Apache Spark is an open source distributed computational framework designed to produce faster computational results. It is a computational engine in memory, i.e. data is processed in memory.
Spark supports various APIs for streaming, charting, SQL, MLLib. It also supports Java, Python, Scala and R as primary languages. Spark is mostly installed in Hadoop clusters, but you can also install and configure the spark in standalone mode.
In this article we will see how to install Apache Spark in Debian and Ubuntu-based distributions.
Install Java and Scala on Ubuntu
Install Apache Spark In Ubuntu, you have to have Java and Scala installed on your machine. Most modern distributions have Java installed by default, and you can confirm it with the following command.
$ java -version
If there is no output, you can install a Java application using our article on how to install Java on Ubuntu, or run the following commands to install Java on Ubuntu and Debian-based distributions.
$ sudo apt update $ sudo apt install default-jre $ java -version
Next you can install Scala from the apt repository by running the following commands to find and install scala.
$ sudo apt search scala ⇒ Search for the package $ sudo apt install scala ⇒ Install the package
To confirm the installation Scala, run the following command.
$ scala -version Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL
Install Apache Spark on Ubuntu
Now go to the official Apache Spark download page and grab the latest version (i.e. 3.1.1) when writing this article. Alternatively, you can use the wget command to download the file directly to the terminal.
$ wget https://apachemirror.wuchna.com/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz
Now open the extension and change the location of the downloaded file and run the following command to extract the Apache Spark scan file.
$ tar -xvzf spark-3.1.1-bin-hadoop2.7.tgz
Finally, transfer the extracted Spark directory /care about directory.
$ sudo mv spark-3.1.1-bin-hadoop2.7 /opt/spark
Set environment variables for Spark
Now you need to set a few environment variables .profile file before starting the spark.
$ echo "export SPARK_HOME=/opt/spark" >> ~/.profile $ echo "export PATH=$PATH:/opt/spark/bin:/opt/spark/sbin" >> ~/.profile $ echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.profile
To ensure that these new environment variables are available in the shell and available to Apache Spark, it is also mandatory to run the following command for the latest changes to take effect.
$ source ~/.profile
All spark-related binaries to start and stop services are sbin folder.
$ ls -l /opt/spark
Launch Apache Spark in Ubuntu
Run the following command to get started Spark master service and slave service.
$ start-master.sh $ start-workers.sh spark://localhost:7077
Once the service is started, go to your browser and enter the following URL access spark page. On the page you can see that my master and slave service has started.
http://localhost:8080/ OR http://127.0.0.1:8080
You can also check if spark shell works well at startup spark shell command.
It is in this article. We will get you an interesting article very soon.