# Bisecting Kmeans Cluster Indices in Apache Spark

# Apache Spark Bisecting Kmeans Dendogram from Linkage Matrix

I am attempting to produce a dendogram of Bisecting Kmeans Clustering results in Spark. I have found a few variations of this question online, like here and there is a JIRA request here. But I have not been able to find any others with a working solutions.

To try and achieve this I have compiled Spark MLlib 2.2.0 with yu-iksw's toLinkageMatrix function for Spark and made some changes to the log output to produce more information about the Bisecting Clustering Selection Process. I have uploaded this Jar with a sample SBT build for test purposes so that anyone interested in helping out who does not want to rebuild the Spark MLlib from source can run their own tests. You can see in my sbt build on my github repo, the mllib and mllib-local jars are in the /lib folder.

To plot my test linkage matrix output I am using jupyter-notebook and manually passing the Spark Linkage outputs to a scipy-dendogram. The jupyter notebooks are also in my github repo here .

In a nutshell my test outputs using the Iris Dataset appear valid when I am using 3-4 clusters, but when I try 5 or more clusters the linkage matrix fails to produce valid cluster indices. I have tried a few different approaches to solve this problem, like altering the toLinkageMatrix selection process and toArray functions which it calls on, to no avail.

I have a decent conceptual understanding of Bisecting K-Means clustering but am having a tough time tracking down where exactly/why the linkage matrix is failing in Spark. You can see my complete spark code if you look at my spark-notebook HTMLs, also on my github repo.

The complete Spark 2.2.0 source used to compile is also in my repo here

The main source files that were altered are here
```
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala
```

and here
```
mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
```

Here are my cluster outputs using both 3 and 10 clusters.

Note: I also re-wrote the scipy dendogram test function to illustrate exactly where and why the z-linkage failure occurs when plotting.

## Iris Data, 3 Cluster Outputs

```
COST
84.20375254574043
CLUSTER CENTERS
Sepal_Length Sepal_Width Petal_Length Petal_Width
5.01 3.37 1.56 0.29
5.95 2.77 4.45 1.45
6.85 3.07 5.74 2.07
ADJACENCY MATRIX
FromNodeID toNodeID distance
0 1 2.540618378626947
0 2 2.540618378626947
2 3 1.044390196577994
2 4 1.044390196577994
LOG OUTPUTS
Feature dimension: 4.
Number of points: 150.
Initial cost: 681.3705999999911.
The minimum number of points of a divisible cluster is 1.
Dividing 1 clusters on level 1.
Dividing 1 clusters on level 2.
The divisible clusters needed for this iteration were : d = 1, cost =681.3705999999911, size = 150
The divisible clusters needed for this iteration were : d = 3, cost =123.79587628866193, size = 97
LINKAGE MATRIX
node1 node2 distance tree_size
1 2 1.044390196577994 2
0 3 2.540618378626947 3
SCIPY DENDOGRAM
```

## Iris Data, 10 Cluster Outputs

```
COST
27.981293071222368
CLUSTER CENTERS (rounded)
Sepal_Length Sepal_Width Petal_Length Petal_Width
4.68 3.08 1.45 0.2
5 2.4 3.2 1.03
5.07 3.46 1.44 0.28
5.4 3.89 1.51 0.27
5.6 2.66 4.05 1.25
6.01 2.71 4.95 1.79
6.4 2.97 4.55 1.41
6.49 2.9 5.37 1.8
6.61 3.16 5.57 2.29
7.48 3.13 6.3 2.05
ADJACENCY MATRIX
fromNodeID toNodeID distance
0 4 1.7986418455383477
0 5 1.7986418455383477
1 6 0.32116390552184665
1 7 0.32116390552184665
2 0 0.48992124227033834
2 1 0.48992124227033834
3 2 2.540618378626947
3 10 2.540618378626947
8 12 0.36029111643410283
8 13 0.36029111643410283
LOG OUTPUTS
Feature dimension: 4.
Number of points: 150.
Initial cost: 681.3705999999911.
The minimum number of points of a divisible cluster is 1.
Dividing 1 clusters on level 1.
Dividing 2 clusters on level 2.
Dividing 4 clusters on level 3.
Dividing 2 clusters on level 4.
d =The divisible clusters needed for this iteration were :
d = 1, cost =681.3705999999911, size = 150
The divisible clusters needed for this iteration were :
d = 3, cost =123.79587628866193, size = 97
The divisible clusters needed for this iteration were :
d = 4, cost =13.72863636363627, size = 22
The divisible clusters needed for this iteration were :
d = 13, cost =10.73588235294028, size = 34
LINKAGE MATRIX
node1 node2 distance tree_size
2 3 0.32116390552184665 2
7 8 0.34473006617374347 2
5 6 0.36029111643410283 2
17 10 0.48992124227033834 4
4 12 0.5802085628837165 3
11 9 0.839611851358609 3
14 15 1.044390196577994 6
0 1 1.7986418455383477 2
13 16 2.540618378626947 10
SCIPY DENDOGRAM FAILURE TEST FUNCTION OUTPUTS
[2. 3. 0.32116391 2. ]
Checking.... if indice A >= # of clusters + iteration we are on
2.0 >= 10+0
= False
Checking .... if indice B >= # of clusters + iteration we are on
3.0 >= 10+0
= False
-------------------------------------------------
[7. 8. 0.34473007 2. ]
Checking.... if indice A >= # of clusters + iteration we are on
7.0 >= 10+1
= False
Checking .... if indice B >= # of clusters + iteration we are on
8.0 >= 10+1
= False
-------------------------------------------------
[5. 6. 0.36029112 2. ]
Checking.... if indice A >= # of clusters + iteration we are on
5.0 >= 10+2
= False
Checking .... if indice B >= # of clusters + iteration we are on
6.0 >= 10+2
= False
-------------------------------------------------
[17. 10. 0.48992124 4. ]
Checking.... if indice A >= # of clusters + iteration we are on
17.0 >= 10+3
= True
Checking .... if indice B >= # of clusters + iteration we are on
10.0 >= 10+3
= False
-------------------------------------------------
[ 4. 12. 0.58020856 3. ]
Checking.... if indice A >= # of clusters + iteration we are on
4.0 >= 10+4
= False
Checking .... if indice B >= # of clusters + iteration we are on
12.0 >= 10+4
= False
-------------------------------------------------
[11. 9. 0.83961185 3. ]
Checking.... if indice A >= # of clusters + iteration we are on
11.0 >= 10+5
= False
Checking .... if indice B >= # of clusters + iteration we are on
9.0 >= 10+5
= False
-------------------------------------------------
[14. 15. 1.0443902 6. ]
Checking.... if indice A >= # of clusters + iteration we are on
14.0 >= 10+6
= False
Checking .... if indice B >= # of clusters + iteration we are on
15.0 >= 10+6
= False
-------------------------------------------------
[0. 1. 1.79864185 2. ]
Checking.... if indice A >= # of clusters + iteration we are on
0.0 >= 10+7
= False
Checking .... if indice B >= # of clusters + iteration we are on
1.0 >= 10+7
= False
-------------------------------------------------
[13. 16. 2.54061838 10. ]
Checking.... if indice A >= # of clusters + iteration we are on
13.0 >= 10+8
= False
Checking .... if indice B >= # of clusters + iteration we are on
16.0 >= 10+8
= False
-------------------------------------------------
```