How to install Apache Spark on Linux 9

Share on Social Media

In this Linux tutorial, you will learn how to install Apache Spark on Linux 9 or other Red Hat based Linux distributions. #centlinux #linux #ApacheSpark

What is Apache Spark?:

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Apache Spark has its architectural foundation in the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. The Dataframe API was released as an abstraction on top of the RDD, followed by the Dataset API. In Spark 1.x, the RDD was the primary application programming interface (API), but as of Spark 2.x use of the Dataset API is encouraged even though the RDD API is not deprecated. The RDD technology still underlies the Dataset API.

Spark and its RDDs were developed in 2012 in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark’s RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.

Inside Apache Spark the workflow is managed as a directed acyclic graph (DAG). Nodes represent RDDs while edges represent the operations on the RDDs. (Source: Wikipedia)

Video to install Apache Spark on Linux:

YouTube player

Environment Specification:

We are using a minimal Rocky Linux 9 virtual machine with following specifications.

  • CPU – 3.4 Ghz (2 cores)
  • Memory – 2 GB
  • Storage – 20 GB
  • Operating System – Rocky Linux release 9.1 (Blue Onyx)
  • Hostname – spark-01.centlinux-com.preview-domain.com
  • IP Address – 192.168.88.136/24

Update your Rocky Linux Server:

By using a ssh client, login to your Rocky Linux server as root user.

Set a Fully Qualified Domain Name (FQDN) and Local Name Resolution for your Linux machine.

# hostnamectl set-hostname spark-01.centlinux-com.preview-domain.com
# echo "192.168.88.136 spark-01 spark-01.centlinux-com.preview-domain.com" >> /etc/hosts

Execute following commands to refresh your Yum cache and update software packages in your Rocky Linux server.

# dnf makecache
# dnf update -y

If above commands update your Linux Kernel and you should reboot your Linux operating system with new Kernel before installing Apache Spark software.

# reboot

After reboot, check the Linux operating system and Kernel versions.

# uname -r
5.14.0-162.18.1.el9_1.x86_64

# cat /etc/rocky-release
Rocky Linux release 9.1 (Blue Onyx)

Install Apache Spark Prerequisites:

Apache Spark is written in Scala programming language; thus it requires Scala support for deployment. Whereas Scala requires Java language support.

There are some other software packages that you may require to download and install Scala and Apache Spark software.

Therefore, you can install all these packages in a single shot of dnf command.

# dnf install -y wget gzip tar java-17-openjdk

After installation, verify the version of active Java.

# java --version
openjdk 17.0.6 2023-01-17 LTS
OpenJDK Runtime Environment (Red_Hat-17.0.6.0.10-3.el9_1) (build 17.0.6+10-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-17.0.6.0.10-3.el9_1) (build 17.0.6+10-LTS, mixed mode, sharing)

Install Scala Programming Language:

Although, we have already written a complete tutorial on installation of Scala on Rocky Linux 9. But we are repeating most necessary steps here for the sake of completeness of this article.

Download Coursier Setup by executing following wget command.

# wget https://github.com/coursier/launchers/raw/master/cs-x86_64-pc-linux.gz
--2023-03-07 21:38:25--  https://github.com/coursier/launchers/raw/master/cs-x86_64-pc-linux.gz
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/coursier/launchers/master/cs-x86_64-pc-linux.gz [following]
--2023-03-07 21:38:26--  https://raw.githubusercontent.com/coursier/launchers/master/cs-x86_64-pc-linux.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20759374 (20M) [application/octet-stream]
Saving to: ‘cs-x86_64-pc-linux.gz’

cs-x86_64-pc-linux. 100%[===================>]  19.80M   508KB/s    in 40s

2023-03-07 21:39:08 (502 KB/s) - ‘cs-x86_64-pc-linux.gz’ saved [20759374/20759374]

Unzip downloaded Coursier Setup file by using gunzip command.

# gunzip cs-x86_64-pc-linux.gz

Rename extracted file to cs for convenience and grant execute permissions on this file.

# mv cs-x86_64-pc-linux cs
# chmod +x cs

Execute Coursier Setup file to initiate installation of Scala programming language.

# ./cs setup
Checking if a JVM is installed
Found a JVM installed under /usr/lib/jvm/java-17-openjdk-17.0.6.0.10-3.el9_1.x86_64.

Checking if ~/.local/share/coursier/bin is in PATH
  Should we add ~/.local/share/coursier/bin to your PATH via ~/.profile, ~/.bash_profile? [Y/n] Y

Checking if the standard Scala applications are installed
  Installed ammonite
  Installed cs
  Installed coursier
  Installed scala
  Installed scalac
  Installed scala-cli
  Installed sbt
  Installed sbtn
  Installed scalafmt

Execute ~/.bash_profile once to setup environment for your current session.

# source ~/.bash_profile

Check the version of Scala software.

# scala -version
Scala code runner version 3.2.2 -- Copyright 2002-2023, LAMP/EPFL

Install Apache Spark on Linux:

Apache Spark is a free software, thus it is available to download at their official website.

You can copy the download link of Apache Spark software and then use it with wget command to download this open-source analytics engine.

# wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
--2023-03-07 22:00:35--  https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 299360284 (285M) [application/x-gzip]
Saving to: ‘spark-3.3.2-bin-hadoop3.tgz’

spark-3.3.2-bin-had 100%[===================>] 285.49M   898KB/s    in 4m 28s

2023-03-07 22:05:04 (1.06 MB/s) - ‘spark-3.3.2-bin-hadoop3.tgz’ saved [299360284/299360284]

Use tar command to extract Apache Spark software and then use mv command to move it to /opt directory.

# tar xf spark-3.3.2-bin-hadoop3.tgz
# mv spark-3.3.2-bin-hadoop3 /opt/spark

Create a file in /etc/profile.d directory to setup environment for Apache Spark during session startup.

# echo "export SPARK_HOME=/opt/spark" >> /etc/profile.d/spark.sh
# echo "export PATH=$PATH:/opt/spark/bin:/opt/spark/sbin" >> /etc/profile.d/spark.sh

Create a user spark and grant ownership of Apache Spark software to this user.

# useradd spark
# chown -R spark:spark /opt/spark

Configure Linux Firewall:

Apache Spark uses a master-slave architecture. The Spark master distributes tasks among Spark Slave services, which can exist on the same or other Apache Spark nodes.

Allow the default service ports of Apache Spark Master and Apache Spart Worker nodes in Linux Firewall.

# firewall-cmd --permanent --add-port=6066/tcp
# firewall-cmd --permanent --add-port=7077/tcp
# firewall-cmd --permanent --add-port=8080-8081/tcp
# firewall-cmd --reload

Create Systemd Services:

Create a systemd service for Spark Master by using vim text editor.

# vi /etc/systemd/system/spark-master.service

Add following directives in this file.

[Unit]
Description=Apache Spark Master
After=network.target

[Service]
Type=forking
User=spark
Group=spark
ExecStart=/bin/bash /opt/spark/sbin/start-master.sh
ExecStop=/opt/spark/sbin/stop-master.sh

[Install]
WantedBy=multi-user.target

Enable and start Spark Master service.

# systemctl enable --now spark-master.service
Created symlink /etc/systemd/system/multi-user.target.wants/spark-master.service → /etc/systemd/system/spark-master.service.

Check the status for Spark Master service.

# systemctl status spark-master.service
● spark-master.service - Apache Spark Master
     Loaded: loaded (/etc/systemd/system/spark-master.service; enabled; vendor >
     Active: active (running) since Tue 2023-03-07 22:13:13 PKT; 4min 13s ago
   Main PID: 6856 (java)
      Tasks: 29 (limit: 10904)
     Memory: 175.8M
        CPU: 5.490s
     CGroup: /system.slice/spark-master.service
             └─6856 /usr/lib/jvm/java-17-openjdk-17.0.6.0.10-3.el9_1.x86_64/bin>

Mar 07 22:13:11 spark-01.centlinux-com.preview-domain.com systemd[1]: Starting Apache Spark Master>
Mar 07 22:13:11 spark-01.centlinux-com.preview-domain.com bash[6850]: starting org.apache.spark.de>
Mar 07 22:13:13 spark-01.centlinux-com.preview-domain.com systemd[1]: Started Apache Spark Master.

Create a systemd service for Spark Slave by using vim text editor.

# vi /etc/systemd/system/spark-slave.service

Add following directives in this file.

[Unit]
Description=Apache Spark Slave
After=network.target

[Service]
Type=forking
User=spark
Group=spark
ExecStart=/bin/bash /opt/spark/sbin/start-slave.sh spark://192.168.200.46:7077
ExecStop=/bin/bash /opt/spark/sbin/stop-slave.sh

[Install]
WantedBy=multi-user.target

Enable and start Spark Slave service.

# systemctl enable --now spark-slave.service
Created symlink /etc/systemd/system/multi-user.target.wants/spark-slave.service → /etc/systemd/system/spark-slave.service.

Check the status of Apache Slave service.

# systemctl status spark-slave.service
● spark-slave.service - Apache Spark Slave
     Loaded: loaded (/etc/systemd/system/spark-slave.service; enabled; vendor p>
     Active: active (running) since Tue 2023-03-07 22:16:33 PKT; 34s ago
    Process: 6937 ExecStart=/bin/bash /opt/spark/sbin/start-slave.sh spark://19>
   Main PID: 6950 (java)
      Tasks: 33 (limit: 10904)
     Memory: 121.0M
        CPU: 5.022s
     CGroup: /system.slice/spark-slave.service
             └─6950 /usr/lib/jvm/java-17-openjdk-17.0.6.0.10-3.el9_1.x86_64/bin>

Mar 07 22:16:30 spark-01.centlinux-com.preview-domain.com systemd[1]: Starting Apache Spark Slave.>
Mar 07 22:16:30 spark-01.centlinux-com.preview-domain.com bash[6937]: This script is deprecated, u>
Mar 07 22:16:31 spark-01.centlinux-com.preview-domain.com bash[6944]: starting org.apache.spark.de>
Mar 07 22:16:33 spark-01.centlinux-com.preview-domain.com systemd[1]: Started Apache Spark Slave.

Access Apache Spark Server:

To access Apache Master Dashboard, open URL http://spark-01.centlinux-com.preview-domain.com:8080 in a web browser.

spark-master-dashboard

Similarly, you can access Apache Slave Dashboard by opening URL http://spark-01.centlinux-com.preview-domain.com:8081 in a web browser.

spark-slave-dashboard

Conclusion – Install Apache Spark on Linux:

In this Linux tutorial, you have learned how to install Apache Spark on Linux 9 or other Red Hat based Linux distributions.