All CategoriesBig Data Hadoop & Spark - Advanced

Twitter Sentiment Analysis Using Spark

Sentiment analysis is the process of analyzing the opinions of a person, a thing or a topic expressed in a piece of text. Sentiment analysis will derive whether the person has a positive opinion or negative opinion or neutral opinion about that topic.

In this blog, we will perform twitter sentiment analysis using Spark. Previously we have performed sentiment analysis on Hadoop eco-system tools i.e., using MapReduce, Hive, and Pig. You can refer to those blogs from the below links.

Sentiment analysis using MapReduce

Sentiment analysis using Hive

Sentiment analysis using Pig

What is Spark?

Apache Spark is a cluster computing framework which runs on Hadoop that handles different types of data. It is a one stop solution to many problems. Spark has rich resources for handling the data and most importantly, it is 10-20x faster than Hadoop’s MapReduce. It attains this speed of computation by its in-memory primitives. The data is cached and is present in the memory (RAM) and performs all the computations in-memory.

Spark’s rich resources have almost all the components of Hadoop. For example, we can perform batch processing in Spark and real-time data processing, without using any additional tools like Kafka/Flume of Hadoop. It has its own streaming engine called Spark streaming.

We can perform various functions with Spark:

  • SQL operations: It has its own SQL engine called Spark SQL. It covers the features of both SQL and Hive.

  • Machine Learning: It has Machine Learning Library, MLib. It can perform Machine Learning without the help of MAHOUT.

  • Graph processing: It performs Graph processing by using GraphX component.

Let us now write a spark program to calculate sentiments

The sample twitter tweet is as follows:

{"filter_level":"low","retweeted":false,"in_reply_to_screen_name":"FilmFan","truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":689085590822891521,"in_reply_to_user_id_str":"6048122","timestamp_ms":"1453125782100","in_reply_to_status_id":null,"created_at":"Mon Jan 18 14:03:02 +0000 2016","favorite_count":0,"place":null,"coordinates":null,"text":"@filmfan hey its time for you guys follow @acadgild  To #AchieveMore and participate in contest Win Rs.500 worth vouchers","contributors":null,"geo":null,"entities":{"symbols":[],"urls":[],"hashtags":[{"text":"AchieveMore","indices":[56,68]}],"user_mentions":[{"id":6048122,"name":"Tanya","indices":[0,8],"screen_name":"FilmFan","id_str":"6048122"},{"id":2649945906,"name":"ACADGILD","indices":[42,51],"screen_name":"acadgild","id_str":"2649945906"}]},"is_quote_status":false,"source":"<a href=\"\" rel=\"nofollow\">TweetDeck<\/a>","favorited":false,"in_reply_to_user_id":6048122,"retweet_count":0,"id_str":"689085590822891521","user":{"location":"India ","default_profile":false,"profile_background_tile":false,"statuses_count":86548,"lang":"en","profile_link_color":"94D487","profile_banner_url":"","id":197865769,"following":null,"protected":false,"favourites_count":1002,"profile_text_color":"000000","verified":false,"description":"Proud Indian, Digital Marketing Consultant,Traveler, Foodie, Adventurer, Data Architect, Movie Lover, Namo Fan","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"Bahubali","profile_background_color":"000000","created_at":"Sat Oct 02 17:41:02 +0000 2010","default_profile_image":false,"followers_count":4467,"profile_image_url_https":"","geo_enabled":true,"profile_background_image_url":"","profile_background_image_url_https":"","follow_request_sent":null,"url":null,"utc_offset":19800,"time_zone":"Chennai","notifications":null,"profile_use_background_image":false,"friends_count":810,"profile_sidebar_fill_color":"000000","screen_name":"Ashok_Uppuluri","id_str":"197865769","profile_image_url":"","listed_count":50,"is_translator":false}}

We will now load the Twitter tweets into Spark using its sqlContext. Twitter tweets will be in JSON format. So we have chosen jsonFile option of sqlContext to load the tweets.

val tweets = sqlContext.jsonFile("/home/kiran/Documents/datasets/tweets")

In the below screenshot you can see that we have loaded the tweets successfully into Spark using its sqlContext.

Now we will create a temporary table for these tweets using registerTempTable function and is as shown below. We have given the table name as a tweet.


Now we will extract the id and the tweet_text from every tweet using the below SQL query.

val extracted_tweets = sql("select id,text from tweet").collect

We have the tweet_id and tweet_text in a Spark SQL row. To calculate the sentiments, we will use a dictionary called AFINN.

AFINN is a dictionary which consists of 2500 words rated from +5 to -5 depending on their meaning.

We will create a table to load the contents of AFINN dictionary. You can download the dictionary from the below link:

AFINN dictionary

We have loaded the AFINN dictionary into an RDD using the below line of code

val AFINN = sc.textFile("/home/kiran/Documents/datasets/AFINN.txt").map(x=> x.split("\t")).map(x=>(x(0).toString,x(1).toInt))

We have created a pair as word and its rating so that we can use this RDD as a look up RDD later on.

Now we will calculate the sentiment of the tweet using the below function.

val tweetsSenti = => {
val tweetWordsSentiment = tweetText(1).toString.split(" ").map(word => {
var senti: Int = 0;
if (AFINN.lookup(word.toLowerCase()).length > 0) {
senti = AFINN.lookup(word.toLowerCase())(0)
val tweetSentiment = tweetWordsSentiment.sum
(tweetSentiment ,tweetText.toString)

The above function will run for every tweet, it takes the tweet text and split the text using space as the delimiter and for every word, the lookup will be done using the AFINN rdd, whenever there is a match of the word its associative rating will be returned and stored in an array tweetWordsSentiment. Finally, we are calculating the sum of the values inside the array tweetWordsSentiment and storing the sum in the variable tweetSentiment.

After calculating the sentiment, we are returning the list of tweetSentment and tweet_id and tweet_text

The above values will be in an array list so we will create an rdd using parallelize function as shown below.

val tweetsSentiRDD: org.apache.spark.rdd.RDD[(Int, String)] = sc.parallelize(tweetsSenti.toList).sortBy(x => x._1, false);

The entire stack of the twitter sentiment analysis using spark can be seen in the below screenshot.

In the above screenshot, you can see the tweetSentiment, tweetId, and the tweetText. So we have successfully calculated the sentiments using Apache Spark.

We hope this blog helped you in understanding how to perform twitter sentiment analysis using Spark. Keep visiting our site for more updates on Big Data and other technologies.



  1. Ok… so, those values what represents? 0 is negative, 1 is neutral and 2 is positive? I try with these tweets
    – “#CustomerService #Job in SaoPaulo: Customer Operations Engineer – Glogal at Cloudera”
    – “One of my hadoop consultant is looking for new project let me know if any one opne requirement for hadoop please”
    And got this:
    So… IDK.
    Thanks a lot.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles