Spark on Minikube

Ganesh Walavalkar
4 min readApr 7, 2023

Kubernetes is by far the most commonly used orchestration engine for containers. Minikube is local Kubernetes, focusing on deployment of Kubernetes on single machine and making it easy to learn. Spark is most commonly used compute engine for data processing. This blog post outlines how to deploy Spark on Minikube. Of course this is not meant for production deployments. The subtleties of scale and performance will be discussed in later posts.

This post borrows heavily from Michael Herman’s post “Deploying Spark on Kubernetes” and his repository “spark-kubernetes”.

Before proceeding on this post, ensure minikube is installed on the node. Details of how to do that is available here.

Now please follow the sequence as given here:

1. Install docker using this link. Please use apt/apt-get to install docker. Do not use snap. If snap is used to install docker, there is possibility that certificate is parsed incorrectly by apparmor. For more information please review this thread.

2. Assuming that minikube was successfully installed, start minikube with following command:

$ minikube start

Memory and CPU limit can be imposed on the this command as desired, however for this experiment, suggestion is to start with given command. Following messages are expected:

Since docker is used for this experiment, third line of message will show that.

3. Download code from this location: https://github.com/wganesh/sparkonminikube, and unzip the code. Code can be pulled as well.

4. Next build the docker image:

$ eval $(minikube docker-env)
$ docker build -f docker/Dockerfile -t spark-hadoop:3.2.0 ./docker

5. Verify that docker image is build correctly

$ docker image ls spark-hadoop

*** output ***
REPOSITORY TAG IMAGE ID CREATED SIZE
spark-hadoop 3.2.0 ac039deb4e90 4 seconds ago 1.13GB

6. Create Spark master deployment, start Spark master service, and crate Spark worker deployment with following commands

$ kubectl create -f ./kubernetes/spark-master-deployment.yaml
$ kubectl create -f ./kubernetes/spark-master-service.yaml
$ kubectl create -f ./kubernetes/spark-worker-deployment.yaml

7. Verify all three steps are successful

$ kubectl get deployments

*** Output ***
NAME READY UP-TO-DATE AVAILABLE AGE
spark-master 1/1 1 1 4h44m
spark-worker 2/2 2 2 4h43m

$ kubectl get pods

*** Output ***
NAME READY STATUS RESTARTS AGE
spark-master-5b75d56678-npgrf 1/1 Running 0 4h44m
spark-worker-899c4d88b-ghklm 1/1 Running 0 4h43m
spark-worker-899c4d88b-krx9l 1/1 Running 0 4h43m

8. To configure ingress object, which is essential for accessing Spark web UI on port 8080 (or whichever is configured), first enable ingress addon and then create incress object

$ minikube addons enable ingress
$ kubectl apply -f ./kubernetes/minikube-ingress.yaml

9. Update /etc/hosts file to route request from the host “spark-kubernetes” to minikube instance, i.e. ip address of the “spark-kubernetes” should be recognized by host OS when it is typed in browser.

$ echo "$(minikube ip) spark-kubernetes" | sudo tee -a /etc/hosts

10. Run following command to get IP address of the spark-master pod. In this scenario it is 10.244.0.5

$ kubectl get pods -o wide

*** Output ***
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
spark-master-5b75d56678-npgrf 1/1 Running 0 8m17s 10.244.0.5 minikube <none> <none>
spark-worker-899c4d88b-ghklm 1/1 Running 0 7m41s 10.244.0.7 minikube <none> <none>
spark-worker-899c4d88b-krx9l 1/1 Running 0 7m41s 10.244.0.6 minikube <none> <none>

11. Post this step Spark web UI should be available at http://spark-kubernetes/ as shown here:

12. Finally test spark with PySpark, expecting prior knowledge of Spark and Python here.

$ kubectl exec spark-master-5b75d56678-npgrf -it -- \pyspark --conf spark.driver.bindAddress=10.244.0.5 --conf spark.driver.host=10.244.0.5

Console (type following commands):
>>> words = 'the quick brown fox jumps over the lazy dog the quick brown fox jumps over the lazy dog'
>>> sc = SparkContext.getOrCreate()
>>> seq = words.split()
>>> data = sc.parallelize(seq)
>>> counts = data.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b).collect()
>>> dict(counts)
{'quick': 2, 'the': 4, 'brown': 2, 'fox': 2, 'jumps': 2, 'over': 2, 'lazy': 2, 'dog': 2}
>>> sc.stop()
>>> quit()

Few additional screenshots to figure out if the entire deployment is working as expected or not:

a. Kubernetes (minikube) workload

b. Kubernetes (minikube) pods status

c. Ingress object

If there is more red/yellow on these screens then green, something hasn’t gone as planned.

--

--