Big Data Hadoop & Spark

Build Your First Application in Spark

In this blog, we will discuss about implementing your first Spark application by executing the wordcount program and then create a histogram showing the count for various words using Matplotlib package in Python.

We recommend readers to refer our previous blogs on Spark installation and RDD operations.

Link 1:

Link 2:

W e have used the below file, first_app which will act as an input file for building our first application.

Creating RDD from the input file, first_app

We have considered the file, first_app and created RDD by using SparkContext’s textFile method.

Counting the number of lines from RDD

In this step, the number of lines present in the RDD that was created in the previous step is displayed.

Applying Map transformation to create new RDD with total character count

In this step, we will count the number of characters in the myfile RDD and store the results into a python object named as num_char.

Displaying the number of characters in the num_char RDD

In this step, we have counted the total number of characters present in num_char RDD.


Splitting the words from myfile RDD

In this step, we need to extract all the words in myfile RDD by using the given regular expression.

The script below will display the words. Refer the screenshot below where all the words are split.

Filtering words with length greater than 3

We need to filter the words whose length is less than 3 and store that in the RDD filtered_word.

Setting split words with count 1

We have applied the map transformation and all the words have been set with the count 1 and new RDD filtered_word has been created having pair of word and 1.

Refer the script below. A list of key and value pair of each word and 1 set to it is displayed.

Adding the number of occurrences for every key

In this step, reduceByKey will create a new RDD with count of every key received from previous RDD filtered_word1

In the screenshot below, Python script which is required to be written in Spark shell to display the histogram representing the frequency of each word is shown.

The histogram representing the frequency of each word is shown below.

We hope this blog helped you in getting started with Spark development using Python. Keep visiting our website Acadgild for more updates on Big Data and other technologies. Click here to learn Apache Spark.


One Comment

  1. In the above example, num_char is mentioned as RDD, Its not an RDD – its the result of an action(reduce). Its of Long type(I suppose).
    Correct me if wrong

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles