python - Removing lines from file based on character/word ratio - unix/bash -


i have 2 files , need remove lines falls under token ratio, e.g.

file 1:

this foo bar question not parallel sentence because it's long hello world 

file 2:

c'est le foo bar question creme bulee bonjour tout le monde 

and ratio calculated total no. of words in file 1 / total no. of words in file 2 , sentences removed if falls under ratio.

then output conjoined file sentences file1 , file2 separated tab:

[out]:

this foo bar question\tc'est le foo bar question hello world\tbonjour tout le monde 

the files have same number of lines. have been doing followed how same in unix bash instead of using python?

# calculate ratio. io.open('file1', , 'r', encoding='utf8') f1, io.open('file2', , 'r', encoding='utf8') f2:      ratio = len(f1.read().split()) / float(len(f2.read().split())) # check , output file. io.open('file1', , 'r', encoding='utf8') f1, io.open('file2', , 'r', encoding='utf8') f2, io.open('fileout', , 'w', encoding='utf8') fout:     l1, l2 in zip(file1, file2):         if len(l1.split())/float(len(l2.split())) > ratio:             print>>fout, "\t".join([l1.strip() / l2.strip()]) 

also, if ratio calculation based on characters instead of words, can in python how achieve same in unix bash? note difference counting len(str.split()) , len(str).

# calculate ratio. io.open('file1', , 'r', encoding='utf8') f1, io.open('file2', , 'r', encoding='utf8') f2:      ratio = len(f1.read()) / float(len(f2.read())) # check , output file. io.open('file1', , 'r', encoding='utf8') f1, io.open('file2', , 'r', encoding='utf8') f2, io.open('fileout', , 'w', encoding='utf8') fout:     l1, l2 in zip(file1, file2):         if len(l1)/float(len(l2)) > ratio:             print>>fout, "\t".join([l1.strip() / l2.strip()]) 

here's simple ratio calculator in awk.

awk 'nr == fnr { a[nr] = nf; next }     { print nf/a[fnr] }' file1 file2 

this merely prints ratio each line. extending print second file when ratio in particular range easy.

awk 'nr == fnr { a[nr] = nf; next }     nf/a[fnr] >= 0.5 && nf/a[fnr] <= 2' file1 file2 

(this uses awk shorthand -- in general form condition { action } if omit { action } defaults { print }. if omit condition, action taken unconditionally.)

you run second pass on file1 same, or run again file names inverted.

oh, wait, here complete solution.

awk 'nr == fnr { a[nr] = nf; w[nr] = $0; next }     nf/a[fnr] >= 0.5 && nf/a[fnr] <= 2 { print w[fnr] "\t" $0 }' file1 file2 

Comments

Popular posts from this blog

javascript - Jquery show_hide, what to add in order to make the page scroll to the bottom of the hidden field once button is clicked -

javascript - Highcharts multi-color line -

javascript - Enter key does not work in search box -