Its been quite a while since I last wrote and I feel terrible about it. So, I have decided that, starting from today, I will try to publish at least one post every week. Through these weekly posts, I will discuss the most interesting algorithmic problem I solved during the week. Some of them, like this one, will be accompanied with an introduction to a new data structure or algorithm I had to learn to solve the problem. Now, lets start disjointset data structure without further ado.
Lets first analyze the problem and understand why we need disjointset data structure. Here is an abridged version of the problem statement (
link to original problem statement):
There are n \((1 \leq n \leq 50000 )\) students in a university and for some given m \(( 0 \leq m \leq \frac{n(n  1)}{2} )\) pairs of students you know that the students in each pair follow the same religion. Find the maximum number of religions that the students in the university believe in assuming that one student can follow exactly one religion.
After reading the statement, we can make the following observations:

If a and b follow the same religion and b and c follow the same religion, then a and c follow the same religion

All the n students of the university can be divided into different sets such that all students in the same set follow the same religion. Initially (i.e. before scanning the input), each student is in a different set.

For a pair \( (a, b) \) in the input we need to find sets \(A\) and \(B\) such that \(a \in A\) and \(b \in B\) and combine them i.e. we have to find \( C = A \cup B\) and replace both \(A\) and \(B\) with \(C\). (Note that since a student cannot follow more than one religion, the sets \(A\) and \(B\) are either equal or disjoint.)
So, we need a need a data structure that can, given \(a\) and \(b\), find sets \(A, B\) such that \(a \in A\) and \(b \in B\) and then find \(C = A \cup B\) efficiently. This is where disjointset data structure comes in.
A disjointset data structure, also called a unionfind set or mergefind set, is a data structure that keeps track of a set of elements partitioned into a number of disjoint sets.
The most efficient implementation is using rooted trees and a couple of heuristic and that’s the only implementation I am going to discuss here although I urge you to explore the linked list implementation and understand why it is slow. Each set in the disjointset forest is represented by a tree in which each node points to its parent and the root of each tree is called the representative of the set. A representative of a set is an element of the set that is used as an identifier for the set i.e every set can be identified by its representative. This data structure should be able to perform these operations efficiently:

makeset(x)
: creates a new set with \(x\) as its only member

find(a)
: finds the set that contains \(a\) i.e returns the representative of the set that contains \(a\)

union(a, b)
: merges (union) the sets containing a and b i.e it computes \(A \cup B\) where \(a \in B\) and \(b \in B\). This is achieved by finding the representative of \(a\) and \(b\) and then making one of them the parent of the other.
Assuming that you explored the linked list implementation, you will notice that this is not faster than that in the worst case. To make it faster we need to use two heuristics: union by rank and path compression.
Union by rank In the
union()
operation above there are no specific instructions on which set should be the parent. It is shown that optimal performance is achieved when the set with the higher
rank is made the parent.
Rank can be defined as an upper bound on the number of edges between in the longest path between the root and a leaf.
Path compression During the
find()
operation, we travel from a given node to the root of the tree. During this traversal we can set the parent for each node in the path to be the representative element thereby reducing the height of the tree and making the subsequent
find()
operations faster. Note that the application of this heuristic does not change the rank because the rank is just an upper bound (which is useful for proving time complexity).
It’s now time to implement this data structure. Here’s the pseudo code:
Let parent[1 ... n] and rank[1 ... n] be arrays such that
parent[x] stores the parent of node x and,
rank[x] stores the rank of node x.
PROCEDURE makeset(x):
1. parent[x] := x
2. rank[x] := 0
PROCEDURE find(x):
1. if (parent[x] != x) // path compression
2. parent[x] := find(parent[x])
3. return parent[x]
PROCEDURE union(x, y):
1. xRoot := find(x) // find the set containing x
2. yRoot := find(y) // find the set containing y
3. if (xRoot == yRoot)
4. return
5.
6. // use union by rank heuristic to merge
7. if (rank[xRoot] < rank[yRoot])
8. parent[xRoot] := yRoot
9. else if rank[xRoot] > rank[yRoot]
10. parent[yRoot] := xRoot
11. else
12. parent[yRoot] := xRoot
13. rank[xRoot] := rank[xRoot] + 1
It can be proved that the time complexity for union and find operations is \(\mathcal{O}(\alpha(n))\) where \(\alpha(n)\) is the very slowly growing inverse
Ackermann function. The value of \(\alpha(n)\) is less than 5 for all practical values of \(n\).
Now that we know how the data structure works and how to implement it, solving the problem is straightforward. But, there’s a problem. The problem requires us to calculate the number of disjoint sets that have formed. To do this we just need to use a variable to keep track of the number of sets. Whenever
makeset()
is called, the variable has to be incremented by 1 and whenever
union()
is called the variable is decremented by 1 (Why?). Take it as a challenge to add this feature to your implementation. Here’s the algorithm (without the set counter) to solve the problem:
PROCEDURE UbiquitousReligions():
1. // add each student to an individual set
2. for i = 1 to n:
3. makeset(i) // add set {i} to the forest
4.
5. // process each of the m pairs
6. for each pair (a, b):
7. union(a, b) // a & b follow the same religion
8.
9. // the answer is the number of disjoint sets
10. answer := getNumberOfSets()
11.
12. return answer
As always, I will leave out the implementation specific details and encourage you to implement the data structure and the above algorithm yourself instead of copying someone’s code off the internet. If you don’t know where or how to start you can have a look at
my solution to the problem (in Java). Also, if you want to submit your solution at UVa Online Judge be mindful of the input and output specifications.
Have a nice day!