Installing and using Spark from Jupyter in three steps on Windows 10
In this brief story you will find all the necessary points that are required to install Spark from your command prompt and call it from your Jupyter notebook to manipulate your dataframe easily.
Actually, I have found many blogs talking about this issue, but most of them were incoherent and incomplete.
1- Installing Anaconda
As it is known, to use Jupyter notebook you will need to install Anaconda. When you install Anconda, you will install conda, Anconda Navigator, Jupyter, python and many other very useful and powerful scientific packages.
To install Anconda on:
- Windows : https://docs.anaconda.com/anaconda/install/windows/
- Mac : https://docs.anaconda.com/anaconda/install/mac-os/
- Linux : https://docs.anaconda.com/anaconda/install/linux/
To check the version of your installed python just type in your Command Prompt:
python — version
You must know that Pyspark works only with python version equal to 2.6 or greater. Thus to upgrade just create new environment in Anaconda Prompt and install the required version using the next command:
# on windows :
conda create — name myenv python=2.6
2- Installing Spark on Windows
To use Spark or Pyspark in your Jupyter notebook you will need to install the next three main elements:
2.1- Java
2.2- Apache Spark
2.3- winutils.exe
2.1- Installing Java
Many applications and websites require java to be installed otherwise they don’t work. Java is fast, secure and reliable. The required Java version for Pyspark is 7 or later.
To verify if Java is installed or not in your computer you can use the next two methods:
* Method 1: Select Start → Control Panel-> Add/Remove Programs
If Java is installed you must find its name there so check if Java is listed among the names of installed softwares like the picture below.
*Method 2: in your Command Prompt type the next command to check the installed Java version. However, be sure that you are in the correct Java path otherwise it will not work for you so go to your Java.exe file using:
cd your java path
java -version # after finding your Java.exe type this
If Java is not installed you will get the next sentence
‘java’ is not recognized as an internal or external command, operable program or batch file.
All what you need to do is visiting the next site and following the indicated instructions, it is easy to install: https://www.java.com/it/download/
After installing close and re-open the Command Prompt to re-check the version!! (Pyspark requires 7 or later)
2.2- Apache Spark
Go to the next site and download hadoop spark: https://www.apache.org/dyn/closer.lua/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
Here it is not required to run any installer all what you need to do is to unzip the files from the downloaded spark-3.0.1-bin-hadoop2.7.tgz folder. You can create a folder called Spark (for example) in your workspace and put all unzipped files there.
The folder path will be like : C:\Users\your_name\Desktop\Spark\spark-2.4.0-bin-hadoop2.7 or you can create this Spark folder under C: in this case your path would be like : C:\Spark\spark-3.0.1-bin-hadoop2.7
I would recommend to copy this folder where all project using Pyspark exist and be sure that the folder path or name does have spaces.
Now you need to add the Spark_Home to your system environmental variable:
1- Right click on Computer and choose Properties
2- Choose the Advanced system settings then click on Environment Variables.
3- Add the variable Spark_Home under the section System Variables :
3.1- Click new and in the name space type Spark_Home
3.2- In the value space type : C:\spark\spark-3.0.1-bin-hadoop2.7 or C:\Users\your_name\Desktop\Spark\spark-2.4.0-bin-hadoop2.7 (depends on your path value)
To verify if Pyspark is installed successfully open Anaconda prompt and type :
Cd Spark_Home\bin and then type pyspark you must get a message like the one bellow (in the picture)
Here our Spark version is 3.0.1 to exit type exit()
2.3- Installing winutils.exe
In order to use Spark on Windows, you need to install winutils.exe which is the final step of this guide. You need to download the convenient version of winutils.exe file and Hadoop depending on the installed version of Spark, you can download it using the next link: https://github.com/steveloughran/winutils.
You must download the version of Hadoop that fits (pre-built for) your Spark. How? In this guide the Hadoop version was: hadoop2. 7 so you need to install the Hadoop folder from the next link: https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1. The winutils.exe file exists under the hadoop-2.7.1\bin folder.
The second part of this step consists of configuring the Spark installation to find out winutils.exe file, you only need to copy the hadoop-2.7.1 folder (it is better to name it only hadoop and delete -2.7.1) from the Download folder to the Spark_Home.
Another simple method is to create hadoop\bin folder inside Spark_Home then copy and paste winutils.exe file there .
The Spark_Home used in this guide was mentioned before as : C:\spark\spark-3.0.1-bin-hadoop2.7
Yet, all what you need is to create another system environmental variable (the same as in the step 2.2) :
- The name of this variable is : Hadoop_Home
- The value is: C:\spark\spark-3.0.1-bin-hadoop2.7\hadoop or simply %Spark_Home%\hadoop
So, if Spark_Home is changed you do not need to update Hadoop_Home
Here is how the path needs to look like : C:\spark\spark-3.0.1-bin-hadoop2.7\hadoop\bin\winutils.exe
3- Using Pyspark and Spark from Jupyter notebook:
3.1- Open your “Anaconda Prompt” and I would recommend to create a separate environment using:
conda create -n spark python=3.6
conda activate spark
3.2- Install findspark using the next instructions:
python -m pip install findspark
To open your Jupyter notebook just type the next instruction in “Anaconda Prompt”
jupyter notebook
3.3- Create a new notebook using New -> Python3 and type the next code to verify if Spark was successfully installed or not:
If you did not get any error so everything is perfect and you have installed Spark.
If you would like to test more use the next Script
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.sql(“select ‘Spark’ as IS_Here “)
df.show()
You must get this as output:
Pyspark is the python API of Spark, it is very powerful and can be implemented in many cases from data analysis to machine learning algorithms application and exploration.