Installing and installing Apache Spark on Ubuntu / Debian

adminMay 9, 2021

0 2,855 2 minutes read

Apache Spark is an open source distributed computational framework designed to produce faster computational results. It is a computational engine in memory, i.e. data is processed in memory.

Spark supports various APIs for streaming, charting, SQL, MLLib. It also supports Java, Python, Scala and R as primary languages. Spark is mostly installed in Hadoop clusters, but you can also install and configure the spark in standalone mode.

In this article we will see how to install Apache Spark in Debian and Ubuntu-based distributions.

Install Java and Scala on Ubuntu

Install Apache Spark In Ubuntu, you have to have Java and Scala installed on your machine. Most modern distributions have Java installed by default, and you can confirm it with the following command.

$ java -version

If there is no output, you can install a Java application using our article on how to install Java on Ubuntu, or run the following commands to install Java on Ubuntu and Debian-based distributions.

$ sudo apt update
$ sudo apt install default-jre
$ java -version

Next you can install Scala from the apt repository by running the following commands to find and install scala.

$ sudo apt search scala  ⇒ Search for the package
$ sudo apt install scala ⇒ Install the package

To confirm the installation Scala, run the following command.

$ scala -version 

Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

Install Apache Spark on Ubuntu

Now go to the official Apache Spark download page and grab the latest version (i.e. 3.1.1) when writing this article. Alternatively, you can use the wget command to download the file directly to the terminal.

$ wget https://apachemirror.wuchna.com/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz

Now open the extension and change the location of the downloaded file and run the following command to extract the Apache Spark scan file.

$ tar -xvzf spark-3.1.1-bin-hadoop2.7.tgz

Finally, transfer the extracted Spark directory /care about directory.

$ sudo mv spark-3.1.1-bin-hadoop2.7 /opt/spark

Set environment variables for Spark

Now you need to set a few environment variables .profile file before starting the spark.

$ echo "export SPARK_HOME=/opt/spark" >> ~/.profile
$ echo "export PATH=$PATH:/opt/spark/bin:/opt/spark/sbin" >> ~/.profile
$ echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.profile

To ensure that these new environment variables are available in the shell and available to Apache Spark, it is also mandatory to run the following command for the latest changes to take effect.

$ source ~/.profile

All spark-related binaries to start and stop services are sbin folder.

$ ls -l /opt/spark

Launch Apache Spark in Ubuntu

Run the following command to get started Spark master service and slave service.

$ start-master.sh
$ start-workers.sh spark://localhost:7077

Once the service is started, go to your browser and enter the following URL access spark page. On the page you can see that my master and slave service has started.

http://localhost:8080/
OR
http://127.0.0.1:8080

You can also check if spark shell works well at startup spark shell command.

$ spark-shell

It is in this article. We will get you an interesting article very soon.

Install Java and Scala on Ubuntu

Install Apache Spark on Ubuntu

Set environment variables for Spark

Launch Apache Spark in Ubuntu

admin

Related Articles

Putting in the Cinnamon 3.6 desktop on Ubuntu and Fedora

Putting in Ubuntu alongside Home windows in Twin-Boot

The right way to set up Mate Desktop on Ubuntu 16.04 / 16.10 and Fedora 22-24 workstations

Putting in and utilizing ProtonVPN on desktop Linux

Leave a Reply Cancel reply