Python process a csv file to remove unicode characters greater than 3 bytes -
i'm using python 2.7.5 , trying take existing csv file , process remove unicode characters greater 3 bytes. (sending mechanical turk, , it's amazon restriction.)
i've tried use top (amazing) answer in question (how filter (or replace) unicode characters take more 3 bytes in utf-8?). assume can iterate through csv row-by-row, , wherever spot unicode characters of >3 bytes, replace them replacement character.
# -*- coding: utf-8 -*- import csv import re re_pattern = re.compile(u'[^\u0000-\ud7ff\ue000-\uffff]', re.unicode) ifile = open('sourcefile.csv', 'ru') reader = csv.reader(ifile, dialect=csv.excel_tab) ofile = open('outputfile.csv', 'wb') writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.quote_all) #skip header row next(reader, none) row in reader: writer.writerow([re_pattern.sub(u'\ufffd', unicode(c).encode('utf8')) c in row]) ifile.close() ofile.close()
i'm getting error:
unicodedecodeerror: 'ascii' codec can't decode byte 0xea in position 264: ordinal not in range(128)
so iterate through rows, stops when gets strange unicode characters.
i'd appreciate pointers; i'm confused. i've replaced 'utf8' 'latin1' , unicode(c).encode unicode(c).decode , keep getting same error.
your input still encoded data, not unicode values. you'd need decode unicode
values first, didn't specify encoding use. need encode again encoded values write output csv:
writer.writerow([re_pattern.sub(u'\ufffd', unicode(c, 'utf8')).encode('utf8') c in row])
your error stems unicode(c)
call; without explicit codec use, python falls default ascii codec.
if use file objects context managers, there no need manually close them:
import csv import re re_pattern = re.compile(u'[^\u0000-\ud7ff\ue000-\uffff]', re.unicode) def limit_to_bmp(value, patt=re_pattern): return patt.sub(u'\ufffd', unicode(value, 'utf8')).encode('utf8') open('sourcefile.csv', 'ru') ifile, open('outputfile.csv', 'wb') ofile: reader = csv.reader(ifile, dialect=csv.excel_tab) writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.quote_all) next(reader, none) # header not added output file writer.writerows(map(limit_to_bmp, row) row in reader)
i moved replacement action separate function too, , used generator expression produce rows on demand writer.writerows()
function.
Comments
Post a Comment