All CategoriesBig Data Hadoop & Spark - Advanced

Determining Popular Hashtags in Twitter Using Pig

In our previous post we had seen how to perform Sentiment Analysis on Twitter data using Pig. In this post we will be discussing how to find out the most popular hashtags in the tweets.

You can refer to this post to know how to get tweets from Twitter using Flume.

We have collected and stored the tweets in HDFS. The tweets are located in the following location of the HDFS: /user/flume/tweets/

The data from Twitter is in ‘Json’ format, so a Pig JsonLoader is required to load the data into Pig. You need to download the required jars for the JsonLoader from the below link:

Pig JsonLoader Jar Files

Next, register the downloaded jars in Pig by using the following commands:

REGISTER '/home/kiran/Desktop/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/home/kiran/Desktop/elephant-bird-pig-4.1.jar';
REGISTER '/home/kiran/Desktop/json-simple-1.1.1.jar';

You can refer to the below screen shot for the same.

Note: You need to provide the path of the jar file accordingly.

After registering the required jars, we can now write a Pig script to perform Sentiment Analysis.

Below is a sample tweets collected for this purpose:

{"filter_level":"low","retweeted":false,"in_reply_to_screen_name":"FilmFan","truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":689085590822891521,"in_reply_to_user_id_str":"6048122","timestamp_ms":"1453125782100","in_reply_to_status_id":null,"created_at":"Mon Jan 18 14:03:02 +0000 2016","favorite_count":0,"place":null,"coordinates":null,"text":"@filmfan hey its time for you guys follow @acadgild To #AchieveMore and participate in contest Win Rs.500 worth vouchers","contributors":null,"geo":null,"entities":{"symbols":[],"urls":[],"hashtags":[{"text":"AchieveMore","indices":[56,68]}],"user_mentions":[{"id":6048122,"name":"Tanya","indices":[0,8],"screen_name":"FilmFan","id_str":"6048122"},{"id":2649945906,"name":"ACADGILD","indices":[42,51],"screen_name":"acadgild","id_str":"2649945906"}]},"is_quote_status":false,"source":"<a href=\"\" rel=\"nofollow\">TweetDeck<\/a>","favorited":false,"in_reply_to_user_id":6048122,"retweet_count":0,"id_str":"689085590822891521","user":{"location":"India ","default_profile":false,"profile_background_tile":false,"statuses_count":86548,"lang":"en","profile_link_color":"94D487","profile_banner_url":"","id":197865769,"following":null,"protected":false,"favourites_count":1002,"profile_text_color":"000000","verified":false,"description":"Proud Indian, Digital Marketing Consultant,Traveler, Foodie, Adventurer, Data Architect, Movie Lover, Namo Fan","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"Bahubali","profile_background_color":"000000","created_at":"Sat Oct 02 17:41:02 +0000 2010","default_profile_image":false,"followers_count":4467,"profile_image_url_https":"","geo_enabled":true,"profile_background_image_url":"","profile_background_image_url_https":"","follow_request_sent":null,"url":null,"utc_offset":19800,"time_zone":"Chennai","notifications":null,"profile_use_background_image":false,"friends_count":810,"profile_sidebar_fill_color":"000000","screen_name":"Ashok_Uppuluri","id_str":"197865769","profile_image_url":"","listed_count":50,"is_translator":false}}

The tweets are in nested Json format and consists of map data types. We need to load the tweets using JsonLoader which supports maps, so we are using elephant bird JsonLoader to load the tweets.

Below is the first Pig statement that is required to load the tweets into Pig:

load_tweets = LOAD '/user/flume/tweets/' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;

When we dump the above relation, we can see that all the tweets have been loaded successfully.

Now, let’s extract the id and the hashtag from the above tweets and the Pig statement for doing this is as shown below

extract_details = FOREACH load_tweets GENERATE FLATTEN(myMap#'entities') as (m:map[]),FLATTEN(myMap#'id') as id;

In the tweet, the hashtag is present in the map object entities, which can be seen in the below image.

Since the hashtags are inside the map entities, we have extracted the entities as map[ ] data type. The schema of the relation extract_details can be viewed using the below command:

describe extract_details;

Now, from the entities, we have to extract the hashtags which is again a map. So we will extract the hashtags as map[ ] data type as well.


hash = foreach extract_details generate FLATTEN(m#'hashtags') as(tags:map[]),id as id;

The extracted hashtags can be viewed by dumping the above relation.

In the above image, we can see that the hashtags object has been extracted successfully. The extracted hashtags is also a map data type, which can see by describing the relation. You can refer to the below image for the same.

Now, from the extracted hashtags, we need to extract text which contains the actual hashtag. This can be done using the following command:

txt = foreach hash generate FLATTEN(tags#'text') as text ,id;

Here, we have extracted the text which starts with # and named it with an alias name text.

We can see the extracted hashtags text in the below screen shot:

In the above image, we can see the hashtag’s text and the tweet_id from which it has originated.

Now, we will group the relation by hashtag’s text by using the below relation:

grp = group txt by text;

We have successfully grouped by hashtag’s text. We can see the schema by describing the relation.

The next thing to do is, count the number of times the hashtag is repeated by the user. This can be achieved using the below relation:

cnt = foreach grp generate group as hashtag_text, COUNT(txt.text) as hashtag_cnt:int;

Now we have the hashtags and its count in a relation as shown in the below screen shot.

We have now successfully performed hashtag count on Twitter data using Pig!

Hope this post has help you learn how to count hashtags. Keep visiting our website Acadgild for more updates on Big Data and other technologies. Click here to learn Big Data Hadoop Development.



  1. Thanks for sharing nice post!!!
    I follow the above steps and i was able to fetch the data from twitter but when i am going to analysis the data using pig as mentioned in this post so i am not getting any data by using dump command. steps which i followed are:
    1) Fetched data from twitter stored in hdfs:
    [[email protected] conf]$ hadoop fs -ls flume
    Found 9 items
    -rw-r–r– 3 cloudera cloudera 179237 2016-08-02 05:36 flume/FlumeData.1470141370000
    -rw-r–r– 3 cloudera cloudera 66274 2016-08-02 05:36 flume/FlumeData.1470141370001
    -rw-r–r– 3 cloudera cloudera 66497 2016-08-02 05:36 flume/FlumeData.1470141370002
    -rw-r–r– 3 cloudera cloudera 83746 2016-08-02 05:36 flume/FlumeData.1470141370003
    -rw-r–r– 3 cloudera cloudera 65313 2016-08-02 05:36 flume/FlumeData.1470141370004
    -rw-r–r– 3 cloudera cloudera 84880 2016-08-02 05:36 flume/FlumeData.1470141370005
    -rw-r–r– 3 cloudera cloudera 71532 2016-08-02 05:36 flume/FlumeData.1470141370006
    -rw-r–r– 3 cloudera cloudera 68419 2016-08-02 05:36 flume/FlumeData.1470141370007
    -rw-r–r– 3 cloudera cloudera 64983 2016-08-02 05:36 flume/FlumeData.1470141370008
    2) in pig grunt shell executed below commands:
    register /home/cloudera/Desktop/Jars/elephant-bird-hadoop-compat-4.1.jar;
    register /home/cloudera/Desktop/Jars/elephant-bird-pig-4.1.jar;
    register /home/cloudera/Desktop/Jars/json-simple-1.1.1.jar;
    load_tweets = LOAD ‘/user/cloudera/flume’ USING com.twitter.elephantbird.pig.load.JsonLoader(‘-nestedLoad’) AS myMap;
    dump load_tweets;
    HadoopVersion PigVersion UserId StartedAt FinishedAt Features
    2.0.0-cdh4.7.0 0.11.0-cdh4.7.0 cloudera 2016-08-02 06:17:36 2016-08-02 06:18:23 UNKNOWN
    Job Stats (time in seconds):
    JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
    job_201608020416_0003 1 0 13 13 13 13 0 0 0 0 load_tweets MAP_ONLY hdfs://localhost.localdomain:8020/tmp/temp-128798977/tmp-1279927259,
    Successfully read 0 records (752029 bytes) from: “/user/cloudera/flume”
    Successfully stored 0 records in: “hdfs://localhost.localdomain:8020/tmp/temp-128798977/tmp-1279927259”
    Total records written : 0
    Total bytes written : 0
    Spillable Memory Manager spill count : 0
    Total bags proactively spilled: 0
    Total records proactively spilled: 0
    Job DAG:
    SO here dump command is executed successfully but its not showing any data…. i don’t know, what went wrong? …can you please help me out.

    1. Hi Pradeep,
      While checking the files in hdfs you have given the command hadoop fs -ls flume so according to this command the path is just flume. But while loading the tweets using pig script you have given the path as /user/cloudera/flume. Both the paths are different therefore you are not getting any output. So please change the input file path in the pig script as ‘flume’.

      1. Hi Satyam ,
        Thanks for the quick reply!!!
        I have tried with changing path as ‘flume’ but no luck. pl check below pig command.
        load_tweets = LOAD ‘flume’ USING com.twitter.elephantbird.pig.load.JsonLoader(‘-nestedLoad’) AS myMap;

        1. Hi Pradeep,
          I am facing the same problem, getting 0 records in output. Did you figure this out?
          Any help would be appreciated!!
          Divya Chopra

          1. Hi Divya,
            Please check the language of the tweets you have collected, they should be in the English language. You can include the below property in your flume.conf file to collect the tweets in the English language.
            TwitterAgent.sources.Twitter.language = en

  2. hi
    i am not able to see the output
    grunt> REGISTER ‘/home/akki/ADBMS/PIG_JAR/elephant-bird-hadoop-compat-4.1.jar’;
    grunt> REGISTER ‘/home/akki/ADBMS/PIG_JAR/elephant-bird-pig-4.1.jar’;
    grunt> REGISTER ‘/home/akki/ADBMS/PIG_JAR/json-simple-1.1.1.jar’;
    grunt> load_tweets = LOAD ‘/user/flume/tweets/’ USING com.twitter.elephantbird.pig.load.JsonLoader(‘-nestedLoad’) AS myMap;
    2017-04-12 11:40:36,262 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – is deprecated. Instead, use fs.defaultFS
    grunt> extract_details = FOREACH load_tweets GENERATE FLATTEN(myMap#’entities’) as (m:map[]),FLATTEN(myMap#’id’) as id;
    2017-04-12 11:41:22,375 [main] WARN org.apache.pig.newplan.BaseOperatorPlan – Encountered Warning IMPLICIT_CAST_TO_MAP 2 time(s).
    [email protected]:~$ hadoop dfs -ls /user/flume/tweets
    DEPRECATED: Use of this script to execute hdfs command is deprecated.
    Instead use the hdfs command for it.
    17/04/12 11:43:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    Found 3 items
    -rw-r–r– 1 hduser supergroup 482269 2017-04-10 22:51 /user/flume/tweets/FlumeData.1491844888001
    -rw-r–r– 1 hduser supergroup 442813 2017-04-10 22:54 /user/flume/tweets/FlumeData.1491845062991
    -rw-r–r– 1 hduser supergroup 475834 2017-04-10 23:00 /user/flume/tweets/FlumeData.1491845396980
    [email protected]:~$

    1. Hi Akshay,
      Please check whether you are able to see the output in load_tweets relation, if you are unable to see then please check whether the tweets you have streamed consists of any junk data or you have got tweets some other languages other than english. If the tweets are of other languages, you need to stream the tweets again using flume, this time include the following property in your flume.conf file TwitterAgent.sources.Twitter.language = en. This property will filter out the tweets of English language.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles