sql - How to count columns where values differ? -

August 15, 2011

i have large table , need check similar rows. don't need column values same, similar. rows must not "distant" (determined query on other table), no value may different (i have done queries these conditions) , other values must same. must expect ambiguity, 1 or 2 different values shouldn't break "similarity" (well, better performance accepting "completely equal" rows, simplification cause errors; option).

the way going solve through pl/pgsql: make loop iterating through results of previous queries. each column, have if testing whether differs; if yes, increment difference counter , go on. @ end of each loop, compare value threshold , see if should keep row "similar" or not.

such pl/pgsql-heavy approach seems slow in comparison pure sql query, or sql query pl/pgsql functions involved. easy test rows x equivalent rows if knew rows should different, difference can occur @ of 40 rows. is there way how solve single query? if not, there faster way examine rows?

edit: mentioned table, in fact group of 6 tables linked 1:1 relationship. don't feel explaining what, that's a different question. extrapolating doing on 1 table situation easy me. simplified (but not oversimplified - should demonstrate difficulties have there) , made example demonstrating need. null , else should count "different". no need make script testing - need find out whether possible in way more efficient thought about.

the point don't need count rows (as usual), but columns.

edit2: previous fiddle - wasn't short, let here archiving reasons.

edit3: simplified example here - not null integers, preprocessing omitted. current state of data:

select * foo;      id | bar1 | bar2 | bar3 | bar4 | bar5      ----+------+------+------+------+------       1 |    4 |    2 |    3 |    4 |   11        2 |    4 |    2 |    4 |    3 |   11        3 |    6 |    3 |    3 |    5 |   13

when run select similar_records( 1 );, should row 2 (2 columns different values; within limit), not 3 (4 different values - outside limit of 2 differences @ most).

to find rows differ on given maximum number of columns:

with cte (    select id          ,unnest(array['bar1', 'bar2', 'bar3', 'bar4', 'bar5']) col  -- more          ,unnest(array[bar1::text, bar2::text, bar3::text                      , bar4::text, bar5::text]) val -- more      foo    ) select b.id, count(a.val <> b.val or null) cols_different   (select * cte id =  1) join   (select * cte id <> 1) b using (col) group  b.id having count(a.val <> b.val or null) < 3 -- max. diffs allowed order  2;

i ignored other distracting details in question.

demonstrating 5 columns. add more required.

if columns can null may want use is distinct from instead of <>.

this using unorthodox, handy parallel unnest(). both arrays must have same number of elements work. details:

is there zip() function in postgresql combines 2 arrays?

sql fiddle (building on yours).

Search This Blog

O9

sql - How to count columns where values differ? -

Comments

Post a Comment

Popular posts from this blog

javascript - Jquery show_hide, what to add in order to make the page scroll to the bottom of the hidden field once button is clicked -

Error while updating a record in APEX screen -

ios - Xcode 5 "No such file or directory" -