
30 CHAPTER 3. PROBABILISTIC SPECTRAL CLUSTERING
Now the Lemma follows from the claim : σ
k+1
(A) ≤ ||A−B||
2
. This is because,
if not, letting now v
(1)
, v
(2)
, . . . v
(k)
, v
(k+1)
be the top k + 1 singular vectors of
A, we would have
|Bv
(t)
| ≥ |Av
(t)
| − ||A − B||
2
> 0,
contradicting the hypothesis that rank of B is k.
3.2 Clustering based on deterministic assump-
tions
We started earlier with a random generative model of data - A. We used Random
Matrix theory to show a bound on ||A −EA||. Then we argued that
ˆ
A, the best
rank k approximation to A is in fact close to EA in spectral norm and used this
to cluster “most” points correctly. However, the “clean-up” of the mis-classified
points presents a technical hurdle which is overcome often by extra assumptions
and involved technical arguments. Here we make an attempt to present a simple
algorithm which classifies all points correctly at once. We start by making
certain assumptions on the model; these assumptions are purely geometric - we
do not assume any probabilistic model. Under these assumptions, we prove that
a simple algorithm correctly classifies all the points. A new feature of this proof
is the use of the “Sin Θ” theorem from Numerical Analysis to argue that not
only are the singular values of
ˆ
A and EA close, but the spaces spanned by these
two matrices are close too. However, our result currently does not subsume
earlier results under the probabilistic model. [See discussion below.]
We are given m points in R
n
(as the rows of an m × n matrix A) and an
integer k and we want to cluster (partition) the points into k clusters. As in
generative models, we assume that there is an underlying (desirable) partition of
{1, 2, . . . m} into T
1
, T
2
, . . . T
k
which forms a “good” clustering and the objective
is to find precisely this clustering (with not a single “misclassified” point.) For
r = 1, 2, . . . k, define µ
r
=
1
|T
r
|
P
i∈T
r
A
(i)
as the center (mean) of the points in
the cluster. Let C be the m × n matrix with C
(i)
= µ
r
for all i ∈ T
r
. We will
now state the assumptions under which we will prove that spectral clustering
works. [We write assumptions of the form a ∈ Ω(b) below to mean that there
is some constant c > 0 such that if the assumption a ≥ cb holds, then the
assertions/algorithms work as claimed. Similarly for a ∈ O(b).] We first assume
Assumption 0 :
||A − C|| = ∆ ≤ O(σ
k
(C)/ log n).
[This is not a major assumption; see discussion below.] We note that ||A −C||
2
can be viewed as the maximum total distance squared in any direction of the
points from their respective centers. So ∆ being small is the same as saying the
displacements of A
(i)
from their respective centers are not “biased” towards any
direction, but sort of spread out. [This is the intuition leading to Wigner-type
bound on the largest singular value of a random matrix.]
Our main assumptions on the model are stated below.