Uncategorized
Trending

Spark Integration With Jupyter Notebook In 10 Minutes

In this post, We will discuss how to integrate apache spark with jupyter notebook on windows.

This blog gives you a detailed explanation as to how to integrate Apache spark with Jupyter notebook on windows.

For Instance, Jupyter notebook is a popular application which enables to run pyspark code before running the actual job on the cluster. In Addition, it is user-friendly so in this blog, we are going to show you how you can integrate pyspark with the jupyter notebook.

What you will learn:

Install and configure anaconda on windows.

  • Step by Step Guide To Install Anaconda (Jupyter notebook)

Setup Winutils For Hadoop and Spark.

  • Download and setup winutils.exe

Install Spark On Windows 

  • Download Spark Binaries 
  • Create Folders For Spark
  • Set Environment Variables For Spark

Integrate Spark With Jupyter Notebook

  • Install Find Spark Module.
  • Run the Spark Code In Jupyter Notebook 

System Prerequisites:

  • Installed Anaconda software 
  • Minimum 4 GB RAM
  • Minimum 500 GB Hard Disk

Before jump into the installation process, you have to install anaconda software which is first requisite which is mentioned in the prerequisite section.

Installing Anaconda On Windows” 

Ajit Khutal

Step 1: Click on this link to install anaconda on windows.

“Installing Pyspark On Windows “

Ajit Khutal

Step 1: Create a directory named pyspark under D drive.

Step 2: Download spark and extract the downloaded file using 7 zip extractor. 

Link To Download Spark: https://spark.apache.org/downloads.html

Extract the Download file into the pyspark folder which we have created earlier in step 1.

Step 3: Create a folder named hadoop under the D drive and create a subfolder named bin under the hadoop folder.

Step 4: Download the winutils.exe

Download the winutils.exe file from the below link and store that file to /hadoop/bin location which is created in Step 3.

Link: Winutils.exe

“Set Environment Variables For PySpark”

Ajit Khutal

Step 5:  Open environment variables.

Step 6: Click on the “environment variable”.

Step 7: Now you will get a new window then click on the New button shown in the below image.

Step 8: Once you click on the new button you will get the below window and then fill the details according to the image.

  • Variable name: SPARK_HOME
  • Variable value: D:\pysaprk

Here We have successfully set the user variables for pyspark.

Now we will set the “system variables” for spark

Step 9: Click on the path and then edit as shown in the below image.

Step 10: Once you click on the edit button you will get the new window then you have to just click on the new button. 

Step 11: After clicking on the new button type the path D:\pyspark\bin and then click on the OK button.

Path=D:\pysaprk\bin

So we have successfully set the user and system environment variables for pyspark.

“Set the Environment Variable For Hadoop(winutils.exe)”

Ajit Khutal

Step 12:  Set Hadoop Home

As we have done previously to set pyspark environment variable same we have to do that for Hadoop(winutils.exe).

User Variable 

System Variable 

Step 13: Click On the Edit button as per shown in the below image.

Step 14: Set Hadoop Home(Bin)

After Clicking the edit button you will get the new window as shown in below image then click on the new button and type the path D\hadoop\bin.

We have completed the setting up environment variables for Hadoop(winutils.exe) and pyspark.

“Integrate Pyspark With Jupyter Notebook

Ajit Khutal

Step 15: Click on Windows and search “Anaconda Prompt”.

Step 16: Download and Install Find spark Module By the below command.

Step 17: Now open Jupyter notebook and type the following code.

As you can see from the above screenshot we have successfully installed spark and integrated with the jupyter notebook.

In Conclusion

firstly we hope above all post was helpful to you to know how to integrate spark, pyspark with a jupyter notebook.

Secondly, Keep visiting our website AcadGild for further updates on data science and other technologies.

Ajit Khutal

Ajit Khutal has been working with AcadGild as an Associate Big Data analyst with expertise in Big Data Technologies like Hadoop, Spark, Kafka, Nifi. He has been a Python enthusiast and been associated with the implementation of many Analytics project related to various domains like E-commerce, Banking, and Education.

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close