Free Shipping

Secure Payment

easy returns

24/7 support

  • Home
  • Blog
  • Hive Use case – Counting Hashtags

Hive Use case – Counting Hashtags

 July 9  | 0 Comments

In this post, we will be discussing about the complex data types in Hive, like Struct and Arrays. We will be taking the Twitter data as an example to explain the complex types and to find the popular hashtags in the tweets.

As the tweets coming in from Twitter are in Json format, we need to load the tweets into the Hive using json input format. Let’s use Cloudera Hive json serde for this purpose.

You can download the Cloudera Json serde from the below link:

https://drive.google.com/open?id=0ByJLBTmJojjzMlhUWGxVMWtXdEU

After downloading Cloudera’s Json serde, we need to copy the jar file into the lib directory of your installed Hive folder. We need to ADD the jar file into Hive as shown below:

Syntax:

ADD jar ‘path of the jar file’;

After successfully adding the Jar file, let’s create a Hive table to store the Twitter data.

For calculating the hashtags, we need the tweet_id and hashtag_text, so we will create a Hive table that will extract the id and hashtag_text from the tweets using the Cloudera Json serde.

Below is the one of the tweet which we have collected:

{"filter_level":"low","retweeted":false,"in_reply_to_screen_name":"FilmFan","truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":689085590822891521,"in_reply_to_user_id_str":"6048122","timestamp_ms":"1453125782100","in_reply_to_status_id":null,"created_at":"Mon Jan 18 14:03:02 +0000 2016","favorite_count":0,"place":null,"coordinates":null,"text":"@filmfan hey its time for you guys follow @acadgild To #AchieveMore and participate in contest Win Rs.500 worth vouchers","contributors":null,"geo":null,"entities":{"symbols":[],"urls":[],"hashtags":[{"text":"AchieveMore","indices":[56,68]}],"user_mentions":[{"id":6048122,"name":"Tanya","indices":[0,8],"screen_name":"FilmFan","id_str":"6048122"},{"id":2649945906,"name":"ACADGILD","indices":[42,51],"screen_name":"acadgild","id_str":"2649945906"}]},"is_quote_status":false,"source":"<a href=\"https://about.twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck<\/a>","favorited":false,"in_reply_to_user_id":6048122,"retweet_count":0,"id_str":"689085590822891521","user":{"location":"India ","default_profile":false,"profile_background_tile":false,"statuses_count":86548,"lang":"en","profile_link_color":"94D487","profile_banner_url":"https://pbs.twimg.com/profile_banners/197865769/1436198000","id":197865769,"following":null,"protected":false,"favourites_count":1002,"profile_text_color":"000000","verified":false,"description":"Proud Indian, Digital Marketing Consultant,Traveler, Foodie, Adventurer, Data Architect, Movie Lover, Namo Fan","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"Bahubali","profile_background_color":"000000","created_at":"Sat Oct 02 17:41:02 +0000 2010","default_profile_image":false,"followers_count":4467,"profile_image_url_https":"https://pbs.twimg.com/profile_images/664486535040000000/GOjDUiuK_normal.jpg","geo_enabled":true,"profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","follow_request_sent":null,"url":null,"utc_offset":19800,"time_zone":"Chennai","notifications":null,"profile_use_background_image":false,"friends_count":810,"profile_sidebar_fill_color":"000000","screen_name":"Ashok_Uppuluri","id_str":"197865769","profile_image_url":"http://pbs.twimg.com/profile_images/664486535040000000/GOjDUiuK_normal.jpg","listed_count":50,"is_translator":false}}

The tweet is in nested json format. From this tweet we will extract the id, which is the tweet_id and hashtag text, which is the hashtag.

Our tweets are stored in the ‘/user/flume/tweets/‘ directory of HDFS as shown below.

Now, let’s create an external table in Hive in the same directory where our tweets are present i.e., ‘/user/flume/tweets/’, so that the tweets present in this location will be automatically stored in the Hive table.

The command for creating a Hive table to store id and hashtag text of the tweets are as follows:

CREATE EXTERNAL TABLE tweets (id BIGINT,entities STRUCT<hashtags:ARRAY<STRUCT<text:STRING>>>) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/flume/tweets';

Struct – It is a collection of elements of different types.

Arrays – It is an ordered collection of elements. The elements in the array must be of the same type.

Let’s take a closer look at the following tweet:

“entities”:{“symbols”:[],”urls”:[],”hashtags“:[{“text”:”AchieveMore”,”indices”:[56,68]}]

Here, entities is a structures in which hashtags is an array consisting of another structures in it where the text is inside the structure.

We can identify the structure in Hive with the open and closed curly braces { } and the Arrays with the open and closed square brackets [ ].

Let’s check the schema of the table by using the command describe tweets;.

Now, we can view the contents of the table using the command select * from tweets;.

 

In the above screen shot, we can see that tweet_id and the hashtags in this tweet has been loaded successfully.

Now, from this structure, we need to extract only the hashtags array which consists of the text. We can achieve this by using the following command:

create table hashtags as select id as id,entities.hashtags.text as words from tweets;

Here, we are creating a table with the name ‘hashtags’ and in that table we are storing the tweet_id and the hashtags text array.

In the above screen shot, you can see that we have successfully extracted the hashtags text. We can view the contents using the below command:

select * from hashtags;

Here, we can see that there are two or more hashtags in each tweet, so we need to extract each hashtag in a new row. In order to do that, we need to use a UDTF, which generates each new row for each value inside an array.

Next, let’s split each word inside the array as a new row. For this we need to use a UDTF (User Defined Table Generating Function). We have a built-in UDTF called explode which will extract each element from an array and create a new row for each element.

 

Now, let’s create another table which can store id and the hashtag text using the below command:

create table hashtag_word as select id as id,hashtag from hashtags LATERAL VIEW explode(words) w as hashtag;

Note: Syntax for LATERAL VIEW explode UDTF is as follows:

lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (‘,’ columnAlias)*fromClause: FROM baseTable (lateralView)
In general, explode UDTF has some limitations; explode cannot be used with other columns in the same select statement. So we will add LATERAL VIEW in conjunction with explode so that the explode function can be used in other columns as well.

We can see the schema of the table by using the ‘describe’ command.

In the above image, we can see that the array of values has been converted into a string. We can see the contents of the table by using the following command:

select * from hashtag_word;

In the below screenshot, we can see that all the hashtags from one tweet are extracted as one separate row.

Now, let’s use the query to calculate the number of times each hashtag has been repeated.

select hashtag, count(hashtag) from hashtag_word group by hashtag;

 

In the above screen shot, we can see that the hashtag and the number of times it is repeated in the Twitter data we have. So we have successfully counted the number of popular hashtags in Twitter using Hive.

Hope this post was helpful in understanding the complex data types in Hive and to determine the popular hashtags in Twitter. Keep visiting our blog for more updates.

Keep visiting our website Acadgild for more updates on Big Data and other technologies. Click here to learn Big Data Hadoop Development.

>