Free Shipping

Secure Payment

easy returns

24/7 support

  • Home
  • Blog
  • Sentiment Analysis on Tweets with Apache Hive Using AFINN Dictionary

Sentiment Analysis on Tweets with Apache Hive Using AFINN Dictionary

 July 15  | 0 Comments

In this post we will be discussing how to perform Sentiment Analysis on the tweets from Twitter using Hive. In our previous post, we had discussed how to perform Sentiment Analysis on the tweets using Pig.

We have collected the tweets from Twitter using Flume, you can refer to this post to know how to collect tweets from Twitter in real-time.

As the tweets coming in from twitter are in Json format, we need to load the tweets into Hive using json input format. We will use Cloudera Hive json serde for this purpose.

You can download the Cloudera Json serde from the below link:

hive-serdes-1.0-SNAPSHOT.jar

After downloading Cloudera Json serde, we need to copy the jar file into lib directory of your installed Hive folder. We need to ADD the jar file into Hive as shown below:

Syntax:

ADD jar 'path of the jar file';

 

After successfully adding the Jar file, we need to create a Hive table to store the Twitter data.

For performing Sentiment Analysis, we need the tweet_id and tweet_text, so we will create a Hive table that will extract the id and tweet_text from the tweets using the Cloudera Json serde.

Below is the one of the tweet which we have collected:

{"filter_level":"low","retweeted":false,"in_reply_to_screen_name":"FilmFan","truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":689085590822891521,"in_reply_to_user_id_str":"6048122","timestamp_ms":"1453125782100","in_reply_to_status_id":null,"created_at":"Mon Jan 18 14:03:02 +0000 2016","favorite_count":0,"place":null,"coordinates":null,"text":"@filmfan hey its time for you guys follow @acadgild To #AchieveMore and participate in contest Win Rs.500 worth vouchers","contributors":null,"geo":null,"entities":{"symbols":[],"urls":[],"hashtags":[{"text":"AchieveMore","indices":[56,68]}],"user_mentions":[{"id":6048122,"name":"Tanya","indices":[0,8],"screen_name":"FilmFan","id_str":"6048122"},{"id":2649945906,"name":"ACADGILD","indices":[42,51],"screen_name":"acadgild","id_str":"2649945906"}]},"is_quote_status":false,"source":"<a href=\"https://about.twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck<\/a>","favorited":false,"in_reply_to_user_id":6048122,"retweet_count":0,"id_str":"689085590822891521","user":{"location":"India ","default_profile":false,"profile_background_tile":false,"statuses_count":86548,"lang":"en","profile_link_color":"94D487","profile_banner_url":"https://pbs.twimg.com/profile_banners/197865769/1436198000","id":197865769,"following":null,"protected":false,"favourites_count":1002,"profile_text_color":"000000","verified":false,"description":"Proud Indian, Digital Marketing Consultant,Traveler, Foodie, Adventurer, Data Architect, Movie Lover, Namo Fan","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"Bahubali","profile_background_color":"000000","created_at":"Sat Oct 02 17:41:02 +0000 2010","default_profile_image":false,"followers_count":4467,"profile_image_url_https":"https://pbs.twimg.com/profile_images/664486535040000000/GOjDUiuK_normal.jpg","geo_enabled":true,"profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","follow_request_sent":null,"url":null,"utc_offset":19800,"time_zone":"Chennai","notifications":null,"profile_use_background_image":false,"friends_count":810,"profile_sidebar_fill_color":"000000","screen_name":"Ashok_Uppuluri","id_str":"197865769","profile_image_url":"http://pbs.twimg.com/profile_images/664486535040000000/GOjDUiuK_normal.jpg","listed_count":50,"is_translator":false}}

The tweet is in nested json format. From this tweet we will extract the id, which is the tweet_id and text, which is the tweet_text.

Our tweets are stored in the ‘/user/flume/tweets/‘ directory of HDFS.

 

Now, let’s create an external table in Hive in the same directory where our tweets are present i.e., ‘/user/flume/tweets/’, so that tweets which are present in this location will be automatically stored in the Hive table.

The command for creating a Hive table to store id and text of the tweets is as follows:

create external table load_tweets(id BIGINT,text STRING) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/flume/tweets'

We can check the schema of the table using the below command:

describe load_tweets;

 

In the above image, we can see that the created Hive table has two rows id and text.

We can view the tweet_id and tweet_text which are present in the table by using the below command:

select * from load_tweets;

In the above image, we can see that tweet_id and tweet_text has been loaded successfully into the table.

Next, we will split the text into words using the split() UDF available in Hive. If we use the split() function to split the text as words, it will return an array of values. So, we will create another Hive table and store the tweet_id and the array of words.

create table split_words as select id as id,split(text,' ') as words from load_tweets;

 

We can see the schema of the table by using the ‘describe’ command.

 

Now, we can view the contents of the table by using the below command:

select * from split_words;

 

Next, let’s split each word inside the array as a new row. For this we need to use a UDTF(User Defined Table Generating Function). We have built-in UDTF called explode which will extract each element from an array and create a new row for each element.

Now, let’s create another table which can store id and word.

create table tweet_word as select id as id,word from split_words LATERAL VIEW explode(words) w as word;

Note: Syntax for LATERAL VIEW explode UDTF is as follows:

lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*fromClause: FROM baseTable (lateralView)

In general, explode UDTF has some limitations; explode cannot be used with other columns in the same select statement. So we will add LATERAL VIEW in conjunction with explode so that the explode function can be used in other columns as well.

We can see the schema of the table by using the ‘describe’ command.

 

In the above image, we can see that the array of values has been converted into a string, we can see the contents of the table by using the following command:

select * from tweet_word;

 

In the above image, we can see that the array of words has been split as one word in a new row.

Let’s use a dictionary called AFINN to calculate the sentiments. AFINN is a dictionary which consists of 2500 words rated from +5 to -5 depending on their meaning.

We will create a table to load the contents of AFINN dictionary. You can download the dictionary from the below link:

AFINN dictionary

create table dictionary(word string,rating int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

 

Now, let’s load the AFINN dictionary into the table by using the following command:

LOAD DATA INPATH '/AFINN.txt' into TABLE dictionary;

We have this AFINN dictionary in the root directory of HDFS.

 

We can view the contents of the dictionary table by using this command:

 

select * from dictionary;

 

Now, we will join the tweet_word table and dictionary table so that the rating of the word will be joined with the word.

create table word_join as select tweet_word.id,tweet_word.word,dictionary.rating from tweet_word LEFT OUTER JOIN dictionary ON(tweet_word.word =dictionary.word);

 

We can see the schema of the table by describing it.

 

In the above image, we can see that the rating column has been added along with the id and the word. Whenever there is a match with the word of the tweet in the dictionary, the rating will be given to that word else NULL will be present.

Let’s view the contents of the table by using the below command:

select * from word_join;

 

Now we will perform the ‘groupby’ operation on the tweet_id so that all the words of one tweet will come to a single place. And then, we will be performing an Average operation on the rating of the words of each tweet so that the average rating of each tweet can be found.

select id,AVG(rating) as rating from word_join GROUP BY word_join.id order by rating DESC;

In the above command, we have calculated the average rating of each tweet by using each word of the tweet and arranging the tweets in the descending order as per their rating.

 

In the above screen shot, you can see the tweet_id and its rating.

Hope this post was helpful in calculating the sentiments of tweets using Hive. Keep visiting our site www.acadgild.com for more updates on Bigdata and other technologies. Click here to learn Big data Hadoop from our Expert Mentors

>