linux

Installing and installing Apache Spark on Ubuntu / Debian

 

Apache Spark is an open source distributed computational framework designed to produce faster computational results. It is a computational engine in memory, i.e. data is processed in memory.

Spark supports various APIs for streaming, charting, SQL, MLLib. It also supports Java, Python, Scala and R as primary languages. Spark is mostly installed in Hadoop clusters, but you can also install and configure the spark in standalone mode.

In this article we will see how to install Apache Spark in Debian and Ubuntu-based distributions.

Install Java and Scala on Ubuntu

Install Apache Spark In Ubuntu, you have to have Java and Scala installed on your machine. Most modern distributions have Java installed by default, and you can confirm it with the following command.

$ java -version
Check the Java version in Ubuntu
Check the Java version in Ubuntu

If there is no output, you can install a Java application using our article on how to install Java on Ubuntu, or run the following commands to install Java on Ubuntu and Debian-based distributions.

$ sudo apt update
$ sudo apt install default-jre
$ java -version
Install Java on Ubuntu
Install Java on Ubuntu

 

Next you can install Scala from the apt repository by running the following commands to find and install scala.

$ sudo apt search scala  ⇒ Search for the package
$ sudo apt install scala ⇒ Install the package
Install Scala on Ubuntu
Install Scala on Ubuntu

To confirm the installation Scala, run the following command.

$ scala -version 

Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

Install Apache Spark on Ubuntu

Now go to the official Apache Spark download page and grab the latest version (i.e. 3.1.1) when writing this article. Alternatively, you can use the wget command to download the file directly to the terminal.

$ wget https://apachemirror.wuchna.com/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz

Now open the extension and change the location of the downloaded file and run the following command to extract the Apache Spark scan file.

$ tar -xvzf spark-3.1.1-bin-hadoop2.7.tgz

Finally, transfer the extracted Spark directory /care about directory.

$ sudo mv spark-3.1.1-bin-hadoop2.7 /opt/spark

Set environment variables for Spark

Now you need to set a few environment variables .profile file before starting the spark.

$ echo "export SPARK_HOME=/opt/spark" >> ~/.profile
$ echo "export PATH=$PATH:/opt/spark/bin:/opt/spark/sbin" >> ~/.profile
$ echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.profile

To ensure that these new environment variables are available in the shell and available to Apache Spark, it is also mandatory to run the following command for the latest changes to take effect.

$ source ~/.profile

All spark-related binaries to start and stop services are sbin folder.

$ ls -l /opt/spark
Spark Binaries
Spark Binaries

Launch Apache Spark in Ubuntu

Run the following command to get started Spark master service and slave service.

$ start-master.sh
$ start-workers.sh spark://localhost:7077
Start the Spark Service
Start the Spark Service

Once the service is started, go to your browser and enter the following URL access spark page. On the page you can see that my master and slave service has started.

http://localhost:8080/
OR
http://127.0.0.1:8080
Spark website
Spark website

You can also check if spark shell works well at startup spark shell command.

$ spark-shell
Spark shell
Spark shell

It is in this article. We will get you an interesting article very soon.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button