All CategoriesBig Data Hadoop & Spark - Advanced

Sentiment Analysis on Twitter – TimeZone wise analysis

In this post, we will be discussing a Twitter use case where Sentiment Analysis will be performed on the tweets and the average of Sentiment Analysis will be measured based on the timezone of the people who tweeted them and thereby know the timezone-wise views of a topic.

You can refer to this blog to get a clear idea of how to collect tweets in real-time.

All the real-time tweets are kept in the location ‘/user/flume/tweets’ in HDFS. You can refer to the below screen shot for the same.

The data from Twitter is in ‘Json’ format, so a Pig JsonLoader is required to load the data into Pig. You need to download the required jars for the JsonLoader from the following link:

Required Jars for JsonLoader

Register the downloaded jars in Pig using the below commands. The same is shown in the screenshot below:

REGISTER '/home/kiran/Desktop/elephant-bird-hadoop-compat-4.1.jar';
 
REGISTER '/home/kiran/Desktop/elephant-bird-pig-4.1.jar';
 
REGISTER '/home/kiran/Desktop/json-simple-1.1.1.jar';


Note: You need to provide the path of the jar file accordingly.
After registering the required jars, we can now write a Pig script to perform Sentiment Analysis.
Below is a sample tweets collected for this purpose:

{"filter_level":"low","retweeted":false,"in_reply_to_screen_name":"FilmFan","truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":689085590822891521,"in_reply_to_user_id_str":"6048122","timestamp_ms":"1453125782100","in_reply_to_status_id":null,"created_at":"Mon Jan 18 14:03:02 +0000 2016","favorite_count":0,"place":null,"coordinates":null,"text":"@filmfan hey its time for you guys follow @acadgild To #AchieveMore and participate in contest Win Rs.500 worth vouchers","contributors":null,"geo":null,"entities":{"symbols":[],"urls":[],"hashtags":[{"text":"AchieveMore","indices":[56,68]}],"user_mentions":[{"id":6048122,"name":"Tanya","indices":[0,8],"screen_name":"FilmFan","id_str":"6048122"},{"id":2649945906,"name":"ACADGILD","indices":[42,51],"screen_name":"acadgild","id_str":"2649945906"}]},"is_quote_status":false,"source":"<a href=\"https://about.twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck<\/a>","favorited":false,"in_reply_to_user_id":6048122,"retweet_count":0,"id_str":"689085590822891521","user":{"location":"India ","default_profile":false,"profile_background_tile":false,"statuses_count":86548,"lang":"en","profile_link_color":"94D487","profile_banner_url":"https://pbs.twimg.com/profile_banners/197865769/1436198000","id":197865769,"following":null,"protected":false,"favourites_count":1002,"profile_text_color":"000000","verified":false,"description":"Proud Indian, Digital Marketing Consultant,Traveler, Foodie, Adventurer, Data Architect, Movie Lover, Namo Fan","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"Bahubali","profile_background_color":"000000","created_at":"Sat Oct 02 17:41:02 +0000 2010","default_profile_image":false,"followers_count":4467,"profile_image_url_https":"https://pbs.twimg.com/profile_images/664486535040000000/GOjDUiuK_normal.jpg","geo_enabled":true,"profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","follow_request_sent":null,"url":null,"utc_offset":19800,"time_zone":"Chennai","notifications":null,"profile_use_background_image":false,"friends_count":810,"profile_sidebar_fill_color":"000000","screen_name":"Ashok_Uppuluri","id_str":"197865769","profile_image_url":"http://pbs.twimg.com/profile_images/664486535040000000/GOjDUiuK_normal.jpg","listed_count":50,"is_translator":false}}

The tweets are in nested Json format and consists of map data types. We need to load the tweets using JsonLoader which supports maps, so we are using elephant bird JsonLoader to load the tweets.

Below is the first Pig statement required to load the tweets into Pig:

load_tweets = LOAD '/user/flume/tweets/' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;


When we dump the above relation, we can see that all the tweets have been loaded successfully.

Hadoop
Now, let’s extract the tweeted user_details, id and the tweet text from the above tweets. The Pig statement necessary to perform this is as shown below:

extract_details = FOREACH load_tweets GENERATE myMap#'user' as User,myMap#'id' as id,myMap#'text' as text;


Inside the user details, we have time_zone, so it is necessary to extract the user. We can view the extracted details by dumping the above relation.

In the above image, we can see that time_zone is in the user details and id, tweet are extracted separately from the tweet.
So, let’s extract id, text and only time_zone from all the user details using the bellow command:

tokens = foreach extract_details generate User#'time_zone' as tz,id,text;


Now, we have only time_zone and id, text from the tweets in the relation tokens. We can view them by dumping the relation tokens.

In the above image, we can see that time_zone and tweet_id and tweet_text are extracted separately.
Now we will perform FLATTEN operation on time_zone and the text to remove the extra brackets and tokenize the text into words at the same time.

Hadoop

flat = foreach tokens generate FLATTEN(tz) as timezone,id,FLATTEN(TOKENIZE(text)) As word;


We can view the result after performing FLATTEN operating by dumping the above relation.

In the above image, we can see that there are different entries for the same id, which means that the text is successfully tokenized.
Now, we have to analyze the Sentiment of the tweet by using the words in the text. We will rate the word as per its meaning from +5 to -5 using the dictionary AFINN. You can download the dictionary from the below link:
AFINN dictionary
We will load the dictionary into Pig using the below statement:

dictionary = load '/AFINN.txt' using PigStorage('\t') AS(word:chararray,rating:int);

We can see the contents of the AFINN dictionary in the below screen shot.


Now, let’s perform a map-side join by joining the tokens statement and the dictionary contents using this command:

word_rating = join flat by word left outer, dictionary by word using 'replicated';

We can see the schema of the word_rating relation, which has the joined content of both the flat relation, containing the time_zone, id, word and the dictionary rating for the word by using the describe statement.

In the above screenshot, we can see that dictionary relation got joined with the flat relation. So now, when the word in flat relation matches with the word in the dictionary, it will give the rating for the word, otherwise NULL will be given.
Now we will extract the time_zone, id and the rating of the word in a relation and name it as rating.

rating = foreach word_rating generate flat::timezone as time_zone,flat::id as id,dictionary::rating as rate;


We can see the schema of the relation rating by describing it. Refer the below screen shot for the same.

Now, in order to calculate the sentiment of the whole tweet based on the each word of the tweet, we need to perform a groupby operation. So, the words rating will get grouped accordingly based on the tweet_id.

word_group = group rating by id;


We have now grouped the words by using their tweet_id so that all the words of the tweet are in a same group.
Now, let’s perform the Average operation on the rating so that we will get the average rating for the tweet from -5 to +5, using the below command:

avg_rate = foreach word_group generate FLATTEN(rating.time_zone) as place, AVG(rating.rate) as tweet_rating;

Here we are performing the FLATTEN operation on time_zone because, when grouping time_zone will be repeated numerous times as the time_zone is present in every word of the tweet. By Flattening, we can un-nest the time_zone and thereby get a new row for every entry.
The output received after dumping the relation can be seen in the below screenshot.

Now, let’s perform grouping on the time_zone so that every tweet under time_zone will be accumulated in one place.

grp = group avg_rate by place;

We have time_zones and the rating of the tweets of that time_zone. So, we can now perform Average operation on the rating to get the time_zone and the average rating of the people for the topic.

fin = foreach grp generate group,AVG(avg_rate.tweet_rating);


We can view the final result by dumping the above relation.

In the above screenshot, we can see that time_zone and the average rating on a topic, based on Sentiment Analysis.
Hope this post helped you learn how to perform country-wise Sentiment Analysis. Stay tuned to our blog for updates on Big Data and other technologies. Got a question for us? Please mention them in the comments section and we will address them at the earliest. Keep visiting our website Acadgild for more updates on Big Data and other technologies. Click here to learn Big Data Hadoop Development.

Hadoop

Tags

4 Comments

  1. grunt> dump load_tweets;
    Above command is giving following error:
    [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher – Backend error message
    java.lang.RuntimeException: could not instantiate ‘com.twitter.elephantbird.pig.load.JsonLoader’ with arguments ‘[-nestedLoad]’
    Please suggest?

  2. i have done same on movie data from twitter to predict the popularity of the movie but. there will be a chance of fake users to create more accounts and tweet positively so how can i avoid this case.
    please suggest

  3. Hi KIRAN
    i have completed Pig,Hive,Map-Reduce,Sqoop. Now i want to switch from Java to Hadoop developer but i heard from one of my Hadoop developer friend that in industrial application they use very less command interface they providing web interface to client to do work.Please suggest in industrial level project what type of aprroach they use.

  4. This is the best site and giving us such a great information on this topic it very help us. I got such a great information from your site only. You made a good site it’s very interesting one.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close