ROUGH DRAFT authorea.com/6902

# Question #1

A. First, we will put all of the $$S(u_i,u_j)$$ values into an array that keeps track of the two people and their similarity level. If there is $$n$$ people, the number of connections will be on the order of $$n^2$$.

Now, we can use counting sort to sort all of these values from the largest to smallest. To use counting sort, we have to make sure these two assumptions are satisfied first:

1. The values must be an integer (we know this because of how the problem is set up).

2. We must know the range of the numbers.

We can find the range of these numbers in linear time by looking through every single value, and keeping track of the max number (at the time). So, in linear time, we can find the max value. Then, we can use counting sort which also costs linear time, so in total, the time for the sort is $$n^2$$ (to find the max)$$+$$ $$n^2$$(to do the sort) = $$O(n^2)$$. Although it looks like it’s not, the efficiency is still linear.

Now, we take the largest value in our list and place those users into the same group. WE will use a graph data structure ot keep track of which nodes are connected to each other. This data structure will be explained in part b. The runtime for using this data structure is $$O(log^*(n))$$.

So when we find the max value from the list, we put these two values into the same group, and update the data structre. THen, we find the next highest value, and check if they are in the same group. If they are, move on to the next one, and if they aren’t, put them in the same group.

We also have a counter that starts at $$n$$ (number of users), and decrements each time, the two nodes are connected. WE stop once that counter equals $$k$$, because that will mean we have $$k$$ groups. So that means this program runs $$n-k$$ times, before we get $$k$$ groups, and then we can stop.

Runtime:

To sort this last of satisfaction levels: $$O(n^2)$$.

Using the data structure to keep track of which nodes are connected: $$O(log^*n)$$. We need to do this $$n-k$$ times.

Total Runtime: $$O(n^2+(n-k)log^*n) = O(n^2)$$

B. Explanation of the data structure sto keep track of which nodes are connected:

Now to keep track of the users that are in a connected component together, we will use a grpah. INitially, each user will have its own individual node and it will have one edge that points to itself. Then each node will have a rep value. If two nodes ahve the same rep value, it is part of the same connected compoenent. To find a node’s rep value, you have to follow the edges that leave the node, and when you searched all the edges, the node you get to, will be the rep value.

So in the initial case, all the nodes point to themselves, so node A has rep value A. So all the nodes have different rep values, and they all start in their own connected components. Now when you want to join to connected components you take the pointer from one node and point it to the other one.

Example:

If you want to put $$A$$ and $$B$$ in the same component.

Make A’s pointer point to B, instead of itself.

Before the change: $$Rep(A) = A$$, $$Rep(B) = B$$

After the change: $$Rep(A) = B$$, $$Rep(B) = B$$

So now, they are in the same component.

Now we can create an optimization for this data structure. Assume A and B are in the same group and C and D are in a different group.

$$Rep(A)=Rep(B)=B$$

$$Rep(C)=Rep(D)=C$$

Now, we want ot join A and D:

Normally we would just make the $$Rep(D)$$ point to A and that would make all the Rep values B. But the problem with this is that if you want to check the $$Rep(D)$$ you would have to search through a whole tree which could take a good amount of time so instead we will perform an optimization.

What we do is we find the $$Rep(A)$$, which is B. We save the $$Rep(D)$$ value, which is C. Then we make both $$D$$ and $$Rep(D)$$ (assuming they are different) point directly to $$B$$. So if we do this every time then when you are looking for the rep value of any node it only takes one move which makes this a very efficient algorithm.

Correctness:

We know this algo will give us k groups because we first start with n groups (each person is it own group) Then we connect two nodes which decreases the number of groups by 1. Continue this process until you are only left with $$k$$. So that is $$n-k$$ iterations.

Now that means we can only connect $$n-k$$ users together if we want exactly $$k$$ groups. So we order all of these similarity values and order them. Then we make sure the n-k highest ones are grouped together. We can’t connect any more nodes so the $$(n-k-1)th$$ highest one will be largest value given those two people aren’t already in the same group. We can’t add anymore edges in this graph or we will have less than $$k$$ groups. And since we took out the highest values we are sure to minimize the the similarity of the most similar pair of users that have been assigned to different groups.

# Question #2

A.

To solve this problem we can just solve the traveling salesman problem. In addition, because this is just a MST we do not need this to be a cycle. So we can delete the edge with the highest value, and that will destroy the cycle, but we will end up with a MST since each node has a max degree of 2.

So we can solve the traveling salesman problem using dynamic programming.

Subproblems will be subsets of the vertices $$S$$ from $$(1,2,....,n)$$ that includes the origin and a specific node j. Let $$C(S,j)$$ be the length of the shortest path visiting each node in S starting at 1 and stopping at $$j$$.

So $$C(S,j)$$ = $$min(C(S-j,i)+d_{ij})$$ for each $$i$$, where $$i$$ are all the neighbors of $$j$$.

$$d_{ij}$$, is the distance from $$i$$ to $$j$$. You must find a path to a neighbor $$i$$, and then add it to the cost $$d_{ij}$$.

This will give us solution of Traveling Salesman Problem so you get the lowest cost tour of this graph. So all we have to do is delete the most expensive edge in the tour. That will give us the minimum spanning tree with degree 2.

Showing that it is NP-Complete:

If there are at most 2 connections for each node in a spanning tree in $$G$$ then $$G$$ must also have a Rudrata Path. The Rudrata path will be exactly the graph of $$G$$ because a spanning tree hits all the nodes so if we follow this path we find the Rudrata path.

If you can’t find a graph with at most 2 connections per node and is a spanning tree in $$G$$ then a Rudrata path does not exist in $$G$$. Assume that there is not a spanning tree in $$G$$ with at most 2 connections per node but there is a Rudrata path in $$G$$. A Rudrata Path visits all the nodes in $$G$$ and has no cycles. Thus the degree of each node in the Rudrata path is at most 2 which would also be a spanning tree with at most 2 connections per node which is a contradiction.

B.

If we have the solution for TSP l then we can find the solution to the 2 degree MST easily because $$TSP=MST$$ with a degree of 2.

For $$d>2$$:

Look at all the vertices that only have one vertex coming out of it. Then check all the edges leaving said vertex and then find the shortest one if it is shorter than the one already in the MST, swap them and that will decrease the total cost. Then continue to look at all the vertices with 1 edge and continue this process. Before you add an edge, make sure the vertex you are going to add it to only has a $$d-1$$ degree so when you add the edge the degree will be a max of $$d$$.

Then for the nodes with more than one edge coming out of it we have to look at cuts.

Consider a graph where there is a node E, that has outward edges to 3 other different nodes A,B,C. Compare the edge that exists in the MST and is going into a node A,B, or C, with the ones that are going into A,B, or C, but are not in the MST. If there exists an edge not in the MST which is cheaper and it wont break the degree constraint, then swap them (them being the edge in the MST and the one that isnt).

# Question #3

1.

Each person is a vertex and there exists an edge between two vertices if there is a conflict between those two people. The weight of each edge is 1. We want to split all the vertices into two clusters. The cost function is the number of edges that go from one cluster to another.

2.

The local search algorithm is to pick a random solution and then create two values for each node. A random solution splits the nodes into two different clusters. The first value is the number of edges that go to the other node in its own cluster. The second value is the number of edges that go a node in the other cluster. If the first value $$>$$ the second value, that means switching this node to the other table would increase the number of edges that cross the clusters and that is what we want to do.

So basically with the local search, we look for a vertex that would increase the number of edges between the partitions if we swapped them.

The algorithm runs in polynomial time because when we check if we can move a vertex, we basically have to look at every single edge. For each vertex, we need to look at the first and second value which are derived by the number of edges. So for each iteration we have to look at all the edges and if this is a dense graph that will be $$n^2$$ operations where $$n$$ is the number of vertices.

Now we know that all the edge weights are 1, and since we only iterate through this program if we are going to increase the cost, that means every iteration must increase the cost by 1.

So that means if there is a total of $$n^2$$ edges, the total cost this problem could have is $$n^2$$. So in the worst case our random split starts with 0 cost and every single iteration only increases the cost by 1. Then this means the maximum number of iterations we can have is $$n^2$$.

So total runtime is $$(n^2*n^2)=O(n^4)$$ which is polynomial.

To show that it provides a 2-approximation solution:

First we know that the sum of all the edges weights >= optimal value. Since the cost is derived from the number of edges, it can’t exceed the sum of the edge weights. Also we know that if we reached a locally optimal solution, that means if you swap two vertices, the total cost must decrease.

Let’s call our two sets be call $$A$$ and $$B$$. If we are a locally optimum solution and we move the node u from A to B, we get:

$$\sum\nolimits_{u \in A, (u,v) \in E}w_{u,v} >= \sum\nolimits_{u \in B, (u,v) \in E}w_{u,v}$$

$$\sum\nolimits_{u \in A, (u,v) \in E}w_{u,v} >= \sum\nolimits_{u \in B, (u,v) \in E}w_{u,v}+\sum\nolimits_{u \in A, (u,v) \in E}w_{u,v}$$

$$\sum\nolimits_{u \in A, (u,v) \in E}w_{u,v} >= \frac{1}{2}\sum\nolimits_{u:(u,v) \in E}w_{u,v}$$

Similarly, for any $$v' \in A$$:

$$\sum\nolimits_{u \in B, (u,v') \in E}w_{u,v'} >= \frac{1}{2}\sum\nolimits_{u:(u,v') \in E}w_{u,v}$$

Summing over all vertices, we obtain the desired result:

$$2c(B) = 2\sum\nolimits_{u \in B, v \in A, (u,v) \in E}w_{u,v} >=\sum\nolimits_{e \in E}w_e >= OPT$$

3.

For the greedy algorithm, start off with both sets A and B empty. Then pick a random vertex and place it in set A. Then pick the next vertex and put it in B so that it will maximize the number of edges between A and B. After this, choose and vertex that will maximize the number of edges between the two sets and place it in the set that maximizes the number of edges. Continue this operation until you run out of vertices.

This is a polynomial time algorithm. There are a total number of $$n$$ iterations, one for each time you pick a vertex. Then to pick a vertex you must look at every single edge and worst case there will be $$n^2$$ number of edges. So that means in total there should be $$n^3$$ operations, which is polynomial.

This is a 2-approximation algorithm. Let the total number of edges in the graph be $$E$$. Let $$n$$ be the number of nodes. For every single edge in the graph, it has two nodes. Let’s label the first node as the $$Start$$ node and the other node at the $$End$$ node.

When the $$End$$ node gets added to the graph problem, the edge is created. Now let $$e_i$$ be the number of edges that the node $$i$$, is considered the $$End$$ node. Since every single edge has one $$End$$ node, that means $$\sum\limits_{i=1}^n e_i = E$$, which is the total number of edges.

Now since we are looking at a greedy algorithm whenever node $$i$$ is added to the sets, the number of edges that are added to this cost will be $$\frac{e_i}{2}$$. This happens because this is a greedy choice. The greedy choice will always choose the set that will create the highest cost. In the worst case, both sets have split $$e_i$$ evenly, so no matter which set you choose you only increase the cost by $$\frac{e_i}{2}$$. But that is the smallest because it will be impossible for both sets to contain less than the average number of nodes that correspond to $$e_i$$ per set.

So the number of edges that cross A and B = sum of the number of edge