In this post, We will discuss how to integrate apache spark with jupyter notebook on windows.
This blog gives you a detailed explanation as to how to integrate Apache spark with Jupyter notebook on windows.
For Instance, Jupyter notebook is a popular application which enables to run pyspark code before running the actual job on the cluster. In Addition, it is user-friendly so in this blog, we are going to show you how you can integrate pyspark with the jupyter notebook.
What you will learn:
Install and configure anaconda on windows.
- Step by Step Guide To Install Anaconda (Jupyter notebook)
Setup Winutils For Hadoop and Spark.
- Download and setup winutils.exe
Install Spark On Windows
- Download Spark Binaries
- Create Folders For Spark
- Set Environment Variables For Spark
Integrate Spark With Jupyter Notebook
- Install Find Spark Module.
- Run the Spark Code In Jupyter Notebook
- Installed Anaconda software
- Minimum 4 GB RAM
- Minimum 500 GB Hard Disk
Before jump into the installation process, you have to install anaconda software which is first requisite which is mentioned in the prerequisite section.
“Installing Anaconda On Windows”
Step 1: Click on this link to install anaconda on windows.
“Installing Pyspark On Windows “
Step 1: Create a directory named pyspark under D drive.
Step 2: Download spark and extract the downloaded file using 7 zip extractor.
Link To Download Spark: https://spark.apache.org/downloads.html
Extract the Download file into the pyspark folder which we have created earlier in step 1.
Step 3: Create a folder named hadoop under the D drive and create a subfolder named bin under the hadoop folder.
Step 4: Download the winutils.exe
Download the winutils.exe file from the below link and store that file to /hadoop/bin location which is created in Step 3.
“Set Environment Variables For PySpark”
Step 5: Open environment variables.
Step 6: Click on the “environment variable”.
Step 7: Now you will get a new window then click on the New button shown in the below image.
Step 8: Once you click on the new button you will get the below window and then fill the details according to the image.
- Variable name: SPARK_HOME
- Variable value: D:\pysaprk
Here We have successfully set the user variables for pyspark.
Now we will set the “system variables” for spark
Step 9: Click on the path and then edit as shown in the below image.
Step 10: Once you click on the edit button you will get the new window then you have to just click on the new button.
Step 11: After clicking on the new button type the path D:\pyspark\bin and then click on the OK button.
So we have successfully set the user and system environment variables for pyspark.
“Set the Environment Variable For Hadoop(winutils.exe)”
Step 12: Set Hadoop Home
As we have done previously to set pyspark environment variable same we have to do that for Hadoop(winutils.exe).
Step 13: Click On the Edit button as per shown in the below image.
Step 14: Set Hadoop Home(Bin)
After Clicking the edit button you will get the new window as shown in below image then click on the new button and type the path D\hadoop\bin.
We have completed the setting up environment variables for Hadoop(winutils.exe) and pyspark.
“Integrate Pyspark With Jupyter Notebook“
Step 15: Click on Windows and search “Anaconda Prompt”.
Step 16: Download and Install Find spark Module By the below command.
Step 17: Now open Jupyter notebook and type the following code.
As you can see from the above screenshot we have successfully installed spark and integrated with the jupyter notebook.
firstly we hope above all post was helpful to you to know how to integrate spark, pyspark with a jupyter notebook.
Secondly, Keep visiting our website AcadGild for further updates on data science and other technologies.