\(\Rightarrow x^{\ast}=\Sigma_{x_i\epsilon X\ and\ x_i>x}\left(x_i-x\right)-\Sigma_{x_i\epsilon X\ and\ x_i<x\ }\left(x_i-x\right)\)
Here in the first summation \(x_i>x\ \) and in the 2nd summation \(x_i<x\)
For \(x^{\ast}\) to be zero both the summation need to cancel each other and that would only be possible when x is the median of the terms because median has equal summation of distances greater than it and lesser than it.
Therefore \(x^{\ast}\ \) would have to be the median of the terms.
Hence Proved.
Exercise 2
part 1
Assumption: Data is sorted.Therefore our clusters would contain consecutive points.
We define UnitCost(i,j) \(with\ i\ \le j\), to be the cost of representing all points with their representative \(\mu_{ij}\).
\(\mu_{ij}=\frac{1}{j-i+1}\sum_{l=1}^jx_l\)
Therefore,
UnitCost(i,j)=\(\Sigma_{l=1}^jdis\tan ce\left(x_l,\mu_{ij}\right)=\Sigma_{l=1}^j\left|\left|x_l-\mu_{ij}\right|\right|_{_2}^{^2}\)
Now, if C (i,l') is the cost of partitioning points into l clusters.Then the optimal clustering solution is given by
\(C\left(i,l'\right)=\min_{l'-1\le j\le i-1\ }\left\{C\left(j,l'-1\right)+UnitCost\left(j+1,i\right)\right\}\)
If we find UnitCost() as a pre-computation, then the previous equation can be evaluated in \(O\left(n^2l\right)\).The precomputation of UnitCost() takes \(O\left(n^3\right)\) time because we need to compute the cost for \(O\left(n^2\right)\) pairs where each calculation takes \(O\left(n\right)\ time\).
Running Time: \(O\left(n^3+n^2k\right)=O\left(n^3\right)\)
The value \(\Sigma_{i=l}^l\Sigma_{x_j\epsilon C_i\ }\left(x_j-\mu_i\right)^2\) would be minimum when the number of clusters is maximum.Conversely, the lesser the number of elements in each cluster the lower the value of \(\Sigma_{i=l}^l\Sigma_{x_j\epsilon C_i\ }\left(x_j-\mu_i\right)^2\).Therefore before starting to form clusters we need to find the number of clusters required to be formed.
We have a restriction that ea\(x_1,x_2.....x_n,\ l\)ch cluster should contain k elements therefore the maximum number of clusters satisfying this condition is equal to \(floor\left(\frac{n}{k}\right)\).Therefore we set the number of clusters l equal to it.
\(l=floor\left(\frac{n}{k}\right)\)
To find the optimal clustering we need to assign elements of the input 1-D array into l clusters in such a way that the sum of squares of within-cluster distances is minimized.
We can use dynamic programming to solve the problem
Polynomial run time Algorithm:
Input: n points {\(x_{1,}x_2....x_n\ ,\ k\)}
Output: A Partition of X into l clusters P={\(C_1,C_2......C_l\)}
1. \(l=floor\left(\frac{n}{k}\right)\)
2. Pick a set of l cluster centers {\(c_1,c_2.....c_k\)}
3. repeat
4. Let cluster \(\left\{C_i=\left\{x\epsilon X\ \right|\ i=\arg\ \min_{j=1...l}Dis\tan ce\left(x,c_j\ \right)\right\}\)
5. for i=1,.....,l do
6. \(c_i=\frac{1}{\left|C_i\right|}\Sigma_{x\epsilon C_i}\ x\)
7. until Convergence
In the above algorithm , we can compute the
part 2
I do not know the solution.