Finding equivalent elements in SQL (regrouping)

16th January 2013

Graph of equivalent elements

Regrouping is something that I've had to do many times during my career in data analytics, unfortunately a lot of relationships aren't defined through a single unique identifier, at least not at first. This code snippet is designed to flush out the transitive relationships defined between pairs of elements and assign all elements in the graph a single unique identifier. Get ready for some SQL looping action...

The code below is simplistic and I'm sure there are better more efficient ways to do this so please let me know in the comments. The example is written in T-SQL (Microsoft SQL Server) and makes use of analytics functions. If you're using another dialect of SQL, it is possible to do the same thing using group by statements or equivalent analytics functions but they're not detailed here.

Before I delve into the code, I'll take a minute to explain what the algorithm does and why it's so useful to have something equivalent in your toolbox.

Calculating two way comparisons is logically straightforward, A=B?, A>B?, A<=B… in each we deal with a pair of elements and from the comparison we get a binary answer, True or False. As such, coding comparisons between pairs is conceptually straightforward and a natural choice. Capturing the results is easy:

Input1 Input2 Comparitor Result
A B = True
C D = True

However, what if we add another relationship comparing A and E:

Input1 Input2 Comparitor Result
A E = True

And this is where this piece of code becomes useful. In a procedural language, we can capture relationships between these elements in a data structure designed for it, for example a tree or graph. In SQL, we're stuck with tables, and so while it's easy to see the relationships between A and B, and A and E, it is not easy to see the relationship between A and E. Using a regrouping technique, we can store a single 'equivalence' identifier that can be used to identify all equivalent elements in the dataset.

To illustrate how the code works I've included an example set of elements, shown in the graph below and re-created in the table #pairs:

Graph of equivalent elements


create table #pairs (
    Input1 nvarchar(1),     
    Input2 nvarchar(1)
)

insert into #pairs values ('A', 'B')
insert into #pairs values ('B', 'C')
insert into #pairs values ('C', 'D')
insert into #pairs values ('C', 'F')
insert into #pairs values ('D', 'E')

insert into #pairs values ('G', 'H')
insert into #pairs values ('G', 'I')

Let's assign an identifier to each pair of elements using the row_number function, it's important that we use a numerical identifier for the pair, as we will rely upon a comparison later to update the pair number as part of the regrouping.


select *, row_number() over (order by Input1, Input2) as pair
into #pairs_indexed
from #pairs

We need to transpose the pairs so that each pair occurs over two rows, one for each element. This is most easily done using a union all.


select Input1 as Id, Pair
into #pairs_transposed
from #pairs_indexed
union all
select Input2 as Id, Pair
from #pairs_indexed

Now we start the looping, this is the real code snippet. The loop will continue indefinitely until we call a break to stop looping. This should only be used when it is guaranteed that the break statement will be called in an acceptable amount of time, otherwise your database will be going forever.

  1. For each element Id, what is the lowest pair number
  2. Map the existing pair numbers for all instances of the Id to the lowest pair number across the Id
  3. If there is no mapping (all existing pair numbers are equal to the lowest pair number) then exit, we're done
  4. Otherwise update all of the existing pair numbers with the lowest pair number across the Id
  5. Repeat from 1.

Note that steps 1 and 2 are run as a single query


DECLARE @rowcount INT
WHILE (1=1)
BEGIN

if OBJECT_ID('tempdb..#minimums') is not null drop table #minimums

-- step 2: Map pair numbers to minimum pair numbers
select pair, min(min_pair) as min_pair
into #minimums
from (     
     -- step 1: For each element Id, what is the lowest pair number
     select id, pair, min(pair) over(partition by id) as min_pair
     from #pairs_transposed
) x
group by pair

-- step 3: Are there any left where pair <> min_pair, if not, we're done
select @rowcount = count(*)
from #minimums
where pair <> min_pair

if @rowcount = 0 
begin
break
end

-- step 4: Update pairs with minimum pairs
update ex
set pair = min_pair
from #pairs_transposed ex
inner join #minimums mins
on ex.pair = mins.pair

END
GO

You can run each statement by itself and watch as the algorithm converges on a solution. You'll see that the algorithm follows a breadth-first approach and so it will take longer to converge when the chains of equivalent elements are deep rather than wide.

Try it out and let me know what improvements you can make.

By Niall Napier

Back to the blog

comments powered by Disqus

This Article's Tags

View the whole tag cloud