Question #1

A. First, we will put all of the \(S(u_i,u_j)\) values into an array that keeps track of the two people and their similarity level. If there is \(n\) people, the number of connections will be on the order of \(n^2\).

Now, we can use counting sort to sort all of these values from the largest to smallest. To use counting sort, we have to make sure these two assumptions are satisfied first:

1. The values must be an integer (we know this because of how the problem is set up).

2. We must know the range of the numbers.

We can find the range of these numbers in linear time by looking through every single value, and keeping track of the max number (at the time). So, in linear time, we can find the max value. Then, we can use counting sort which also costs linear time, so in total, the time for the sort is \(n^2\) (to find the max)\(+\) \(n^2\)(to do the sort) = \(O(n^2)\). Although it looks like it’s not, the efficiency is still linear.

Now, we take the largest value in our list and place those users into the same group. WE will use a graph data structure ot keep track of which nodes are connected to each other. This data structure will be explained in part b. The runtime for using this data structure is \(O(log^*(n))\).

So when we find the max value from the list, we put these two values into the same group, and update the data structre. THen, we find the next highest value, and check if they are in the same group. If they are, move on to the next one, and if they aren’t, put them in the same group.

We also have a counter that starts at \(n\) (number of users), and decrements each time, the two nodes are connected. WE stop once that counter equals \(k\), because that will mean we have \(k\) groups. So that means this program runs \(n-k\) times, before we get \(k\) groups, and then we can stop.

Runtime:

To sort this last of satisfaction levels: \(O(n^2)\).

Using the data structure to keep track of which nodes are connected: \(O(log^*n)\). We need to do this \(n-k\) times.

Total Runtime: \(O(n^2+(n-k)log^*n) = O(n^2)\)

B. Explanation of the data structure sto keep track of which nodes are connected:

Now to keep track of the users that are in a connected component together, we will use a grpah. INitially, each user will have its own individual node and it will have one edge that points to itself. Then each node will have a rep value. If two nodes ahve the same rep value, it is part of the same connected compoenent. To find a node’s rep value, you have to follow the edges that leave the node, and when you searched all the edges, the node you get to, will be the rep value.

So in the initial case, all the nodes point to themselves, so node A has rep value A. So all the nodes have different rep values, and they all start in their own connected components. Now when you want to join to connected components you take the pointer from one node and point it to the other one.

Example:

If you want to put \(A\) and \(B\) in the same component.

Make A’s pointer point to B, instead of itself.

Before the change: \(Rep(A) = A\), \(Rep(B) = B\)

After the change: \(Rep(A) = B\), \(Rep(B) = B\)

So now, they are in the same component.

Now we can create an optimization for this data structure. Assume A and B are in the same group and C and D are in a different group.

\(Rep(A)=Rep(B)=B\)

\(Rep(C)=Rep(D)=C\)

Now, we want ot join A and D:

Normally we would just make the \(Rep(D)\) point to A and that would make all the Rep values B. But the problem with this is that if you want to check the \(Rep(D)\) you would have to search through a whole tree which could take a good amount of time so instead we will perform an optimization.

What we do is we find the \(Rep(A)\), which is B. We save the \(Rep(D)\) value, which is C. Then we make both \(D\) and \(Rep(D)\) (assuming they are different) point directly to \(B\). So if we do this every time then when you are looking for the rep value of any node it only takes one move which makes this a very efficient algorithm.

Correctness:

We know this algo will give us k groups because we first start with n groups (each person is it own group) Then we connect two nodes which decreases the number of groups by 1. Continue this process until you are only left with \(k\). So that is \(n-k\) iterations.

Now that means we can only connect \(n-k\) users together if we want exactly \(k\) groups. So we order all of these similarity values and order them. Then we make sure the n-k highest ones are grouped together. We can’t connect any more nodes so the \((n-k-1)th\) highest one will be largest value given those two people aren’t already in the same group. We can’t add anymore edges in this graph or we will have less than \(k\) groups. And since we took out the highest values we are sure to minimize the the similarity of the most similar pair of users that have been assigned to different groups.