All CategoriesBig Data Hadoop & Spark - Advanced

Sentiment Analysis on Tweets with Apache Pig Using AFINN Dictionary

In this post, we will discuss how to perform Sentiment Analysis on Twitter data using Pig. To begin with, we will be collecting real-time tweets from Twitter using Flume.
You can refer to this blog to get a clear idea on how to collect tweets in real time using Apache Flume.
All the real-time tweets are kept in the location ‘/user/flume/tweets‘ of HDFS. You can refer to the below screenshot for the same.

The data from Twitter is in ‘Json’ format, so a Pig JsonLoader is required to load the data into Pig. You need to download the required jars for the JsonLoader from the below link:

Required Jars for JsonLoader

Register the downloaded jars in pig by using the below commands:

REGISTER '/home/kiran/Desktop/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/home/kiran/Desktop/elephant-bird-pig-4.1.jar';
REGISTER '/home/kiran/Desktop/json-simple-1.1.1.jar';

You can refer below screenshot for the same.

Note: You need to provide the path of the jar file accordingly.
After registering the required jars, we can now write a Pig script to perform Sentiment Analysis.
Below is a sample tweets collected for this purpose:

{"filter_level":"low","retweeted":false,"in_reply_to_screen_name":"FilmFan","truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":689085590822891521,"in_reply_to_user_id_str":"6048122","timestamp_ms":"1453125782100","in_reply_to_status_id":null,"created_at":"Mon Jan 18 14:03:02 +0000 2016","favorite_count":0,"place":null,"coordinates":null,"text":"@filmfan hey its time for you guys follow @acadgild To #AchieveMore and participate in contest Win Rs.500 worth vouchers","contributors":null,"geo":null,"entities":{"symbols":[],"urls":[],"hashtags":[{"text":"AchieveMore","indices":[56,68]}],"user_mentions":[{"id":6048122,"name":"Tanya","indices":[0,8],"screen_name":"FilmFan","id_str":"6048122"},{"id":2649945906,"name":"ACADGILD","indices":[42,51],"screen_name":"acadgild","id_str":"2649945906"}]},"is_quote_status":false,"source":"<a href=\"https://about.twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck<\/a>","favorited":false,"in_reply_to_user_id":6048122,"retweet_count":0,"id_str":"689085590822891521","user":{"location":"India ","default_profile":false,"profile_background_tile":false,"statuses_count":86548,"lang":"en","profile_link_color":"94D487","profile_banner_url":"https://pbs.twimg.com/profile_banners/197865769/1436198000","id":197865769,"following":null,"protected":false,"favourites_count":1002,"profile_text_color":"000000","verified":false,"description":"Proud Indian, Digital Marketing Consultant,Traveler, Foodie, Adventurer, Data Architect, Movie Lover, Namo Fan","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"Bahubali","profile_background_color":"000000","created_at":"Sat Oct 02 17:41:02 +0000 2010","default_profile_image":false,"followers_count":4467,"profile_image_url_https":"https://pbs.twimg.com/profile_images/664486535040000000/GOjDUiuK_normal.jpg","geo_enabled":true,"profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","follow_request_sent":null,"url":null,"utc_offset":19800,"time_zone":"Chennai","notifications":null,"profile_use_background_image":false,"friends_count":810,"profile_sidebar_fill_color":"000000","screen_name":"Ashok_Uppuluri","id_str":"197865769","profile_image_url":"http://pbs.twimg.com/profile_images/664486535040000000/GOjDUiuK_normal.jpg","listed_count":50,"is_translator":false}}

The tweets are in nested Json format and consists of map data types. We need to load the tweets using JsonLoader which supports maps, so we are using elephant bird JsonLoader to load the tweets.
Below is the first Pig statement required to load the tweets into Pig:

load_tweets = LOAD '/user/flume/tweets/' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;

When we dump the above relation, we can see that all the tweets got loaded successfully.

Hadoop
Now, we shall extract the id and the tweet text from the above tweets. The Pig statement necessary to perform this is as shown below:

extract_details = FOREACH load_tweets GENERATE myMap#'id' as id,myMap#'text' as text;

We can see the extracted id and tweet text from the tweets in the below screen shot.

We have the tweet id and the tweet text in the relation named as extract_details.

Now, we shall extract the words from the text using the TOKENIZE keyword in Pig.

tokens = foreach extract_details generate id,text, FLATTEN(TOKENIZE(text)) As word;

From the below screenshot, we can see that the text got divided into words.

Now, we have to analyze the sentiment of the tweets by using the words in the text. We will rate the word as per its meaning from +5 to -5 using the dictionary AFINN. The AFINN is a dictionary which consists of 2500 words which are rated from +5 to -5 depending on their meaning. You can download the dictionary from the following link:

AFINN dictionary

We will load the dictionary into Pig by using the below statement:

dictionary = load '/AFINN.txt' using PigStorage('\t') AS(word:chararray,rating:int);

Hadoop

We can see the contents of the AFINN dictionary in the below screen shot.

Now, let’s perform a map-side join by joining the tokens statement and the dictionary contents using this command:

word_rating = join tokens by word left outer, dictionary by word using 'replicated';

We can see the schema of the statement after performing join operation by using the below command:

describe word_rating;

In the above screenshot, we can see that the word_rating has joined the tokens(consists of id, tweet text, word) statement and the dictionary(consists of word, rating).
Now we will extract the id, tweet text and word rating(from the dictionary) by using the below relation:

rating = foreach word_rating generate tokens::id as id,tokens::text as text, dictionary::rating as rate;

We can now see the schema of the relation rating by using the command describe rating.

In the above screenshot, we can see that our relation now consists of id, tweet text, and rate(for each word).
Now, we will group the rating of all the words in a tweet by using the below relation:

word_group = group rating by (id,text);

Here we have grouped by two constraints, id and tweet text.
Now, let’s perform the Average operation on the rating of the words per each tweet.

avg_rate = foreach word_group generate group, AVG(rating.rate) as tweet_rating;

Now we have calculated the Average rating of the tweet using the rating of the each word. You can refer to the below image for the same.

From the above relation, we will get all the tweets i.e., both positive and negative.
Here, we can classify the positive tweets by taking the rating of the tweet which can be from 0-5. We can classify the negative tweets by taking the rating of the tweet from -5 to -1.

We have now successfully performed the Sentiment Analysis on Twitter data using Pig. We now have the tweets and its rating, so let’s perform an operation to filter out the positive tweets.
Now we will filter the positive tweets using the below statement:

positive_tweets = filter avg_rate by tweet_rating>=0;

We can see the positive tweets and its rating in the below screen shot.

In the above screenshot, we can see the tweet_id,tweet_text and its rating.
To perform sentiment analysis according to the tweet timezone, refer this blog.
Hope you found this post helpful in performing Sentiment Analysis on Twitter data using Pig. Keep visiting our site www.acadgild.com for more updates on Bigdata and other technologies.Click here to learn Bigdata Hadoop from our Expert Mentors
Hadoop

Tags

2 Comments

  1. How to get the number of positive tweets. in this it will give all the positive tweets id and text, how to get the number of records in positive tweets.

  2. Pingback: Sentiment Analysis on Positive Occurrences – Geek Planet

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close