python - Removing lines from file based on character/word ratio

python - Removing lines from file based on character/word ratio - unix/bash -

August 15, 2010

i have 2 files , need remove lines falls under token ratio, e.g.

file 1:

this foo bar question not parallel sentence because it's long hello world

file 2:

c'est le foo bar question creme bulee bonjour tout le monde

and ratio calculated total no. of words in file 1 / total no. of words in file 2 , sentences removed if falls under ratio.

then output conjoined file sentences file1 , file2 separated tab:

[out]:

this foo bar question\tc'est le foo bar question hello world\tbonjour tout le monde

the files have same number of lines. have been doing followed how same in unix bash instead of using python?

# calculate ratio. io.open('file1', , 'r', encoding='utf8') f1, io.open('file2', , 'r', encoding='utf8') f2:      ratio = len(f1.read().split()) / float(len(f2.read().split())) # check , output file. io.open('file1', , 'r', encoding='utf8') f1, io.open('file2', , 'r', encoding='utf8') f2, io.open('fileout', , 'w', encoding='utf8') fout:     l1, l2 in zip(file1, file2):         if len(l1.split())/float(len(l2.split())) > ratio:             print>>fout, "\t".join([l1.strip() / l2.strip()])

also, if ratio calculation based on characters instead of words, can in python how achieve same in unix bash? note difference counting len(str.split()) , len(str).

# calculate ratio. io.open('file1', , 'r', encoding='utf8') f1, io.open('file2', , 'r', encoding='utf8') f2:      ratio = len(f1.read()) / float(len(f2.read())) # check , output file. io.open('file1', , 'r', encoding='utf8') f1, io.open('file2', , 'r', encoding='utf8') f2, io.open('fileout', , 'w', encoding='utf8') fout:     l1, l2 in zip(file1, file2):         if len(l1)/float(len(l2)) > ratio:             print>>fout, "\t".join([l1.strip() / l2.strip()])

here's simple ratio calculator in awk.

awk 'nr == fnr { a[nr] = nf; next }     { print nf/a[fnr] }' file1 file2

this merely prints ratio each line. extending print second file when ratio in particular range easy.

awk 'nr == fnr { a[nr] = nf; next }     nf/a[fnr] >= 0.5 && nf/a[fnr] <= 2' file1 file2

(this uses awk shorthand -- in general form condition { action } if omit { action } defaults { print }. if omit condition, action taken unconditionally.)

you run second pass on file1 same, or run again file names inverted.

oh, wait, here complete solution.

awk 'nr == fnr { a[nr] = nf; w[nr] = $0; next }     nf/a[fnr] >= 0.5 && nf/a[fnr] <= 2 { print w[fnr] "\t" $0 }' file1 file2

Search This Blog

O9

python - Removing lines from file based on character/word ratio - unix/bash -

Comments

Post a Comment

Popular posts from this blog

Error while updating a record in APEX screen -

javascript - Jquery show_hide, what to add in order to make the page scroll to the bottom of the hidden field once button is clicked -

ios - Xcode 5 "No such file or directory" -