linux - Using sed, tr, ... to fix the structure of a file -
i have file lines should
u:<text>\td:<text>\ta:<text>\n where < text > text without tab or newline characters. \t tab , \n newline character. unfortunately < text > fields contain newline character structure broken. example this:
u:uuu d:ddd a:aaa u:uuu d:ddd a:aaa u:uu u d:ddd a:aaa u:uuu d:ddd a:aaa here there newline character in field u in 3rd line, causing of content should in 3rd line in 4th. how can fix structure tools sed or tr? want delete newline characters not @ end of record.
so example above fixed file should this:
u:uuu d:ddd a:aaa u:uuu d:ddd a:aaa u:uuu d:ddd a:aaa u:uuu d:ddd a:aaa an other important aspect of solution speed, since have gigabytes of files fix.
given input data (saved in file data):
u:uuu d:ddd a:aaa1 u:uuu d:ddd a:aaa2 u:uu u d:ddd a:aaa3 u:uuu d:ddd a:aaa4 u:uuu d:dd d a:aaa5 u:uuu d:ddd a:aaa6 the sed script (saved in file sed.script):
/^u:.* d:.* a:.*/ { p; d; } /^u:.* d:.*/ { n; s/\n *//; p; d; } /^u:.*/ { n; s/\n *//; p; d; } can run , produces output shown:
$ sed -f sed.script data u:uuu d:ddd a:aaa1 u:uuu d:ddd a:aaa2 u:uuu d:ddd a:aaa3 u:uuu d:ddd a:aaa4 u:uuu d:ddd a:aaa5 u:uuu d:ddd a:aaa6 $ the first line of script looks u:, d: , a: on single line, assumes complete (and not broken a: text field) , prints line , deletes (which skips other actions in script). second line looks u: , d: only; a: presumably on next line. appends next line of input, removes embedded newline , following spaces (if any), , prints , deletes before. third line looks u: , assumes both d: , a: on next line. appends next line, removes embedded newline , following spaces (if any), , prints , deletes before.
extending handle breaks in a: text field non-trivial. non-trivial extend handle:
u:uu u d:dd d a:aaa7 neither formally impossible (especially if choose use perl or python instead of sed), not simple. double-split simpler handle; inside third line, you'd have second set of conditional actions based on whether a: found or not, etc.
handling multiple splits single field:
u:u u u d:d d d a:aaa would tricky — doable, in sed, tricky.
Comments
Post a Comment