python - Removing lines from file based on character/word ratio - unix/bash -
i have 2 files , need remove lines falls under token ratio, e.g.
file 1:
this foo bar question not parallel sentence because it's long hello world
file 2:
c'est le foo bar question creme bulee bonjour tout le monde
and ratio calculated total no. of words in file 1 / total no. of words in file 2
, sentences removed if falls under ratio.
then output conjoined file sentences file1 , file2 separated tab:
[out]:
this foo bar question\tc'est le foo bar question hello world\tbonjour tout le monde
the files have same number of lines. have been doing followed how same in unix bash instead of using python?
# calculate ratio. io.open('file1', , 'r', encoding='utf8') f1, io.open('file2', , 'r', encoding='utf8') f2: ratio = len(f1.read().split()) / float(len(f2.read().split())) # check , output file. io.open('file1', , 'r', encoding='utf8') f1, io.open('file2', , 'r', encoding='utf8') f2, io.open('fileout', , 'w', encoding='utf8') fout: l1, l2 in zip(file1, file2): if len(l1.split())/float(len(l2.split())) > ratio: print>>fout, "\t".join([l1.strip() / l2.strip()])
also, if ratio calculation based on characters instead of words, can in python how achieve same in unix bash? note difference counting len(str.split())
, len(str)
.
# calculate ratio. io.open('file1', , 'r', encoding='utf8') f1, io.open('file2', , 'r', encoding='utf8') f2: ratio = len(f1.read()) / float(len(f2.read())) # check , output file. io.open('file1', , 'r', encoding='utf8') f1, io.open('file2', , 'r', encoding='utf8') f2, io.open('fileout', , 'w', encoding='utf8') fout: l1, l2 in zip(file1, file2): if len(l1)/float(len(l2)) > ratio: print>>fout, "\t".join([l1.strip() / l2.strip()])
here's simple ratio calculator in awk.
awk 'nr == fnr { a[nr] = nf; next } { print nf/a[fnr] }' file1 file2
this merely prints ratio each line. extending print second file when ratio in particular range easy.
awk 'nr == fnr { a[nr] = nf; next } nf/a[fnr] >= 0.5 && nf/a[fnr] <= 2' file1 file2
(this uses awk shorthand -- in general form condition { action }
if omit { action }
defaults { print }
. if omit condition, action taken unconditionally.)
you run second pass on file1
same, or run again file names inverted.
oh, wait, here complete solution.
awk 'nr == fnr { a[nr] = nf; w[nr] = $0; next } nf/a[fnr] >= 0.5 && nf/a[fnr] <= 2 { print w[fnr] "\t" $0 }' file1 file2
Comments
Post a Comment