Spark — Reading data from file

3 min readMay 30, 2023

If you are not familiar with running spark on minikube, then I recommend reading my previous post Spark on Minikube.

In this post we will read the data from external file and count the words. It demonstrates how to read from external file and also shows how to aggregate.

So just as in previous post let us check out few things first:

1. Docker image

$ docker image ls spark-hadoop

*** output ***
REPOSITORY     TAG       IMAGE ID       CREATED       SIZE
spark-hadoop   3.2.0     ac039deb4e90   7 weeks ago   1.13GB

2. Kubernetes deployments and pods

$ kubectl get deployments

*** output ***
NAME           READY   UP-TO-DATE   AVAILABLE   AGE
spark-master   1/1     1            1           53d
spark-worker   2/2     2            2           53d

$ kubectl get pods -o wide

*** output ***
NAME                            READY   STATUS    RESTARTS      AGE   IP            NODE       NOMINATED NODE   READINESS GATES
spark-master-5b75d56678-npgrf   1/1     Running   2 (10h ago)   53d   10.244.0.18   minikube   <none>           <none>
spark-worker-899c4d88b-ghklm    1/1     Running   4 (10h ago)   53d   10.244.0.24   minikube   <none>           <none>
spark-worker-899c4d88b-krx9l    1/1     Running   4 (10h ago)   53d   10.244.0.23   minikube   <none>           <none>

3. Now the fun part. Login into master node, and create a file

Note that I’m using this technique just for demonstration. Also for those who are not familiar with kubernetes, this will be a good learning.

1 $ kubectl exec --stdin --tty spark-master-5b75d56678-npgrf -- /bin/bash

2 root@spark-master-5b75d56678-npgrf:/# cd opt/spark/examples/

root@spark-master-5b75d56678-npgrf:/opt/spark/examples#

So using first line, I logged into master node and started bash. Note that the name of the master node can be obtained from output of the ‘kubectl get pods’ command.

Once logged in, you can change the directory to opt/spark/examples, which is where most of the examples are. The master node is just like any other linux machine. Be careful while modifying anything here, as it will jeopardize the remote updates to the pods.

# touch demo.txt
# echo 'The rarest of all human qualities is consistency.' > demo.txt
# cat demo.txt
The rarest of all human qualities is consistency.

Execute all the above commands in spark-master-5b75d56678-npgrf to create a file.

4. Run word count program

The examples directory contains many examples of SQL, Streaming, Machine Learning using Java, Python and Scala.

To run word count program run following command

# spark-submit --master local[8] src/main/python/wordcount.py demo.txt

*** output ***
...
23/05/30 04:48:58 INFO DAGScheduler: Job 0 finished: collect at /opt/spark-3.2.0-bin-hadoop2.7/examples/src/main/python/wordcount.py:38, took 1.329756 s
The: 1
rarest: 1
of: 1
all: 1
human: 1
qualities: 1
is: 1
consistency.: 1
...

Of course output is truncated, however you will be able to see the entire output on your screen. I wish the code wouldn’t have included ‘.’ at the end of the sentence, however it is just an example…

Now the job is submitted using ‘local’ spark master, in production no one would do that. However this will give nice peek into how containers run.

There are several things to do here:

a. The file is way too small for Spark to do word count on. It can be increased substantially.

b. When the file size increases, we will need persistent volume, something like Hitachi Storage. In this case, of course, object storage will be preferred.

c. The way we have called this spark-submit is utterly wrong, and should be called either from application running in a VM or in a container.

The purpose of this post is just to demonstrate, and we will discuss the best practices to implement AIML use cases somewhere else.

Spark — Reading data from file

Written by Ganesh Walavalkar