Spark Cluster Sizing

5 min readSep 24, 2021

Ask

For a given a) data volume in terms of historical and incremental, and b) data velocity, specify the infrastructure in terms of CPU and Memory.

Assumptions (or given)

Data in warehouse: ~400TB, per month increment: ~30TB
Daily data processing: ~ 300TB (New: 30TB + Historical: 270TB)
To process about a TB of data, it takes 10 minutes.
Please note that this is extremely subjective. The “10 minutes” number can be used as a multiplier. So to process 300TB of data each day, it will take the rack (say with X machines)~50 hours (nightly build). If we employ 10 machines, this would reduce to 50/10 ~ 5 hours.
This also leaves us room for data science related load, input/output from external data sources, and weekend processing, along with good room for error handling and reprocessing if required.
Variety is trivial and will not impact the calculations, veracity of the data is reasonable and will not impact the calculations
Data is local, not much efforts are required to source the data from outside.

Expectation

Please provide infrastructure specifications in terms of CPU and Memory. Please specify the final cost of the cluster.

Response

Calculations — I (Starting with business requirements)

Spark executor per node (each consuming a core) — 5 cores + an additional core per worker node for OS & execution management = Total 6 cores (See note in #2 in additional notes in Appendix section)
Spark executor memory per node — 24GB (per executor) * 5 executor per node = 120GB. Add 5GB (Typically 2GB) per node for OS & execution management = Total 125GB
Say it takes 10 to 15 mins “per core & 25 GB” combination to process reasonably complex data (this is extremely subjective, already stated under assumption #3 above, however please consider this as a starting point to continue our discussion)
To process 300 TB of data — 300TB*15 mins = 4500 mins or 75 hours of processing is required.
To complete the nightly processing under 6 to 7 hours, 12 servers are required. Since there is reasonable buffer, the cluster could be started with 10 server, each with 12C/24T, 256GB RAM.
Required disk space
Existing: 400TB
Incremental: 360TB (30TB *12)
Total: ~800TB
From #5 above, if cluster is started with 10 servers, then 80TB per server can be provisioned for 800TB of total disk space.

Notes:

YARN multiple of 0.5, that is the # of core required per server— 12 (6 cores — #1 of calculations * 2)
YARN multiple of 0.5, means memory required per server — 256GB (125GB per core — #2 of calculations * 2)

Based on the business requirements we calculated above what should be cluster sizing in terms of CPU and memory.

This same calculation can be done starting at the number of servers as well. For example we can start at 10 server cluster and find out how much data it can process.

The problem with starting with servers is not knowing the starting point for the calculations. For example to process 300TB of data, should we start at 20 servers or 30 servers? With this approach, it will lead to multiple iterations.

To verify the calculations done above, cluster sizing is started with number of servers with specific parameters in the next section.

Calculations— II (Starting with nodes)

Number of nodes — 10
Cores per node — 12
Number of cores available for Spark application at 0.5 YARN multiplier — 6
Reduce a single core for management+OS, remaining cores — 5
Memory per node — 256GB
Memory available for Spark application at 0.5 YARN multiplier — 128GB
Reduce 8GB (on higher side, however easy for calculation) for management+OS, remaining memory per core — (120/5) 24GB
Total available cores for the cluster — 50 (5*10) * 0.9 = 45 (Consider 0.1 efficiency loss)
Total available memory for the cluster — 1.2TB (120GB*10) * 0.9 — 1.08TB (Consider 0.1 efficiency loss)

If you consider 15 mins to process 1TB of data per core and 25GB (reasonable assumption), then this cluster can process 4TB per node per hour — that is 40TB for entire cluster per hour. To process 300TB it would take approximately 8 hours, which ties back to calculation — I.

Putting it together

10 Servers recommended with 12C/24T, 256GB configuration, for example consider this combination
PowerEdge R540 Rack Server
Intel® Xeon® Silver 4214R 2.4G, 12C/24T, 9.6GT/s, 16.5M Cache, Turbo, HT (100W) DDR4–2400
64GB RDIMM, 3200MT/s, Dual Rank
This is available for ~ $11K.
For storage, a separate disk storage servers such as “PowerVault MD1420” can be considered. However for this example 5 disks “16TB 7.2K SATA 6Gbps 512e 3.5in Hot-Plug Hard Drive” each with per node are recommended, which will give 800TB. This is almost close to a PB.
The incremental cost is ~ $5K
Total cost per server = $16K, cost of 10 servers = $160K
Based on the assumptions and business case there will be significant deviations for the above cost, however this could be a framework to take informed decision.
The cluster can be arranged as follows

Appendix

Streaming and Machine Learning

For streaming, the data disk size could be smaller, however memory should be on the higher side. The calculations for this can be done using how many topics are expected, how many subscribers/publishers are expected, and what will be the volume and velocity of the messages
For advanced analytics aka machine learning, high memory machines along with GPU such as NVIDIA Tesla T4 can be used. Please note that this technology is evolving rapidly and best place to explore is here.
There are machine learning clusters are available as well, click here.

Additional Notes

Rack, KVM switches, power strips, cooling etc. are not considered
Rack: APC NetShelter SX — Rack — black — 42U — 19-inch
KVM switch: 8-Port NetDirector™ Rack Console KVM
Power : 1.8kW Single-Phase 120V Basic PDU
Rack cooling: ASR-9010-FAN-V2 CISCO ASR 9010 FAN TRAY
Why 6 cores? Why 12C/24T? Why not buy 16C/32T CPUs?
Currently Spark works best with 8–16 cores. 12C/24T is right at the center.
Why use < 8 disks per node?
We recommend having 4–8 disks per node, configured without RAID (just as separate mount points)
Business applications: ~ 25 (± 5) ranging from medium (5 spark applications) to high (13 spark applications) complexity can be accommodated in this cluster. These numbers are based on observations.
YARN utilization is considered at 0.5 or 50%
There is a well written article about what global parameters should be set: Calculate your Spark application settings
After going through the above calculations, I came up with following calculation:
Parameter, Value
spark.executor.instances, 59
spark.yarn.executor.memoryOverhead, 4096
spark.executor.memory, 24G
spark.yarn.driver.memoryOverhead, 4096
spark.driver.memory, 24G
spark.executor.cores, 5
spark.driver.cores, 2
These numbers are inline with calculations in previous discussion.
There are various Spark modes as listed here: https://spark.apache.org/docs/latest/cluster-overview.html, for this discussion YARN Cluster mode is selected.
Cluster mode — Through out the discussion below following components are considered