Spark — Running K Means

Ganesh Walavalkar
3 min readMay 30, 2023

Most frequently asked question — how do I run ML algorithm on huge set of data?

Many ways to do that, however Spark is by far the most popular to solve these kind of problems.

What is k-Means? — well read that somewhere else. Here is my post about all clustering algorithms.

k-means clusters, take from wikipedia (link)
  1. Some housekeeping first

Let us check our deployments and pods

$ kubectl get deployments

*** ouput ***
NAME READY UP-TO-DATE AVAILABLE AGE
spark-master 1/1 1 1 53d
spark-worker 2/2 2 2 53d

$ kubectl get pods -o wide

*** ouput ***
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
spark-master-5b75d56678-npgrf 1/1 Running 2 (12h ago) 53d 10.244.0.18 minikube <none> <none>
spark-worker-899c4d88b-ghklm 1/1 Running 4 (12h ago) 53d 10.244.0.24 minikube <none> <none>
spark-worker-899c4d88b-krx9l 1/1 Running 4 (12h ago) 53d 10.244.0.23 minikube <none> <none>

2. Login in to spark-master-5b75d56678-npgrf to check if numpy is installed

While running the k-means example, I discovered that the program failed as numpy was not installed. When you try to run different examples available in /opt/spark/examples/src/main/python/mllib, there will be additional libraries missing. For example naive_bayes_example.py may need NaiveBayes and MLUtils. It is upto a particular program to dictate what libraries are required.

For k-means, we need numpy as can be seen from this code

from numpy import array
from math import sqrt
from pyspark import SparkContext
from pyspark.mllib.clustering import KMeans, KMeansModel

As mentioned earlier, without inspecting the code, I ran and discovered that numpy is required. You can inspect the code first and ensure all the necessary packages are installed. The later approach is better.

I installed numpy with following command

# python3 -m pip install -U numpy

*** output ***
Requirement already satisfied: numpy in /usr/local/lib/python3.9/dist-packages (1.24.3)
Collecting numpy
Using cached numpy-1.24.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Downloading numpy-1.24.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
|████████████████████████████████| 17.3 MB 4.2 MB/s

Note — It says ‘Requirement already satisfied’, since I have already installed it once, just ignore that…

3. Submit the job

In previous post, I showed how to submit from the container itself. However here we will submit it from the server.

$ kubectl exec --stdin --tty spark-master-5b75d56678-npgrf -- spark-submit --master local[8] /opt/spark/examples/src/main/python/ml/kmeans_example.py

*** output ***
23/05/30 06:44:12 INFO CodeGenerator: Code generated in 2.855479 ms
23/05/30 06:44:12 INFO TorrentBroadcast: Destroying Broadcast(25) (from destroy at ClusteringMetrics.scala:402)
Silhouette with squared euclidean distance = 0.9997530305375207
23/05/30 06:44:12 INFO BlockManagerInfo: Removed broadcast_25_piece0 on spark-master-5b75d56678-npgrf:43131 in memory (size: 516.0 B, free: 413.8 MiB)
Cluster Centers:
[9.1 9.1 9.1]
[0.1 0.1 0.1]
23/05/30 06:44:12 INFO BlockManagerInfo: Removed broadcast_26_piece0 on spark-master-5b75d56678-npgrf:43131 in memory (size: 24.1 KiB, free: 413.8 MiB)

I submitted the job using kubectl, which again is not optimal, however will pass just to show that the spark cluster (1 master + 2 workers) will process the machine learning algorithm on minicube cluster, and hence as an extension on any kubernetes cluster, which was the purpose of this post.

For better example of submitting spark jobs, please visit: submitting applications.

--

--