linux - Using sed, tr, ... to fix the structure of a file -

February 15, 2012

i have file lines should

u:<text>\td:<text>\ta:<text>\n

where < text > text without tab or newline characters. \t tab , \n newline character. unfortunately < text > fields contain newline character structure broken. example this:

u:uuu     d:ddd     a:aaa u:uuu     d:ddd     a:aaa u:uu     u    d:ddd    a:aaa u:uuu     d:ddd     a:aaa

here there newline character in field u in 3rd line, causing of content should in 3rd line in 4th. how can fix structure tools sed or tr? want delete newline characters not @ end of record.

so example above fixed file should this:

u:uuu     d:ddd     a:aaa u:uuu     d:ddd     a:aaa u:uuu     d:ddd     a:aaa u:uuu     d:ddd     a:aaa

an other important aspect of solution speed, since have gigabytes of files fix.

given input data (saved in file data):

u:uuu     d:ddd     a:aaa1 u:uuu     d:ddd     a:aaa2 u:uu     u     d:ddd     a:aaa3 u:uuu     d:ddd     a:aaa4 u:uuu     d:dd               d     a:aaa5 u:uuu     d:ddd     a:aaa6

the sed script (saved in file sed.script):

/^u:.* d:.* a:.*/ { p; d; } /^u:.* d:.*/ { n; s/\n *//; p; d; } /^u:.*/ { n; s/\n *//; p; d; }

can run , produces output shown:

$ sed -f sed.script data u:uuu     d:ddd     a:aaa1 u:uuu     d:ddd     a:aaa2 u:uuu     d:ddd     a:aaa3 u:uuu     d:ddd     a:aaa4 u:uuu     d:ddd     a:aaa5 u:uuu     d:ddd     a:aaa6 $

the first line of script looks u:, d: , a: on single line, assumes complete (and not broken a: text field) , prints line , deletes (which skips other actions in script). second line looks u: , d: only; a: presumably on next line. appends next line of input, removes embedded newline , following spaces (if any), , prints , deletes before. third line looks u: , assumes both d: , a: on next line. appends next line, removes embedded newline , following spaces (if any), , prints , deletes before.

extending handle breaks in a: text field non-trivial. non-trivial extend handle:

u:uu     u     d:dd               d     a:aaa7

neither formally impossible (especially if choose use perl or python instead of sed), not simple. double-split simpler handle; inside third line, you'd have second set of conditional actions based on whether a: found or not, etc.

handling multiple splits single field:

u:u    u     u            d:d               d                d                       a:aaa

would tricky — doable, in sed, tricky.

Search This Blog

O9

linux - Using sed, tr, ... to fix the structure of a file -

Comments

Post a Comment

Popular posts from this blog

java - How to specify maven bin in eclipse maven plugin? -

single sign on - Logging into Plone site with credentials passed through HTTP -

php - Why does AJAX not process login form? -