All CategoriesBig Data Hadoop & Spark - Advanced

Streaming Twitter Data Using Flume

We all know that Hadoop is a framework which helps in storing and processing huge datasets and Sqoop component is used to transfer files from traditional databases like RDBMS to HDFS and vice versa when the data is of the structured type.
What if we want to load the data which is of type semi-structured and unstructured into the  HDFS cluster, or else capture the live streaming data which is generated, from different sources like twitter, weblogs and more into the HDFS cluster, which component of Hadoop ecosystem will be useful to do this kind of job. The solution is FLUME.
Learning Flume will help users to collect from and store a large amount of data from different sources into the Hadoop cluster.
What is Apache Flume?
Apache Flume is a Hadoop ecosystem component used to collect, aggregate and moves a large amount of log data from different sources to a centralized data store.
It is an open source component which is designed to locate and store the data in a distributed environment and collects the data as per the specified input key(s).
Flume Architecture
Before moving forward to know the working of flume tool, It is mandatory to know the Flume architecture first.

Flume is composed of the following components.
Flume Event: It is the main unit of the data that is transported inside the Flume (Typically a single log entry). It contains a payload of the byte array that is to be transported from the source path to the destination path which could be accompanied by optional headers.                                                                                                                                       
A Flume event will be in the following structure.

Header Byte Payload


Flume Agent:
Is an independent Java virtual machine daemon process which receives the data (events) from clients and transports to the subsequent destination (sink or agent).
Source: Is the component of Flume agent which receives data from the data generators say, twitter, facebook, weblogs from different sites and transfers this data to one or more channels in the form of Flume event.
The external source sends data to Flume in a format that is recognized by the target Flume source. Example, an Avro Flume source can be used to receive Avro data from Avro clients or other Flume agents in the flow that send data from an Avro sink, or the Thrift Flume source will receive data from a Thrift sink, or a Flume Thrift RPC client or Thrift Clients are written in any language generated from the Flume thrift protocol.
Channel: Once, the Flume source receives an Event, it stores this data into one or more channel and buffers them till they are consumed by sinks. It acts as a bridge between the source and sinks. These channels are implemented to handle any number of sources and sinks.
Sink: It stores the data into the centralized stores like HDFS and HBase.
Streaming Twitter Data
To stream data to our database from twitter we should have the following pre-requisites.

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

  • Twitter account
  • Hadoop cluster

If both prerequisites are available we can move to our further step.
Step 1:
Login to  the twitter account

Step 2:
Go to the following link and click the  ‘create new app’ button. 
https://apps.twitter.com/app

 
Step 3:
Enter the necessary details.

Step 4:
Accept the developer agreement and select the ‘create your Twitter application’ button.

Step 5:
Select the ‘Keys and Access Token’ tab.

Step 6:
Copy the consumer key and the consumer secret code.

Step 7:
Scroll down further and select the ‘create my access token’ button.

Now, you will receive a message stating “that you have successfully generated your application access token”.

Step 8:
Copy the Access Token and Access token Secret code.
 
Hadoop

Follow Step 9 and Step 10 to install Apache flume
Step 9: Download flume tar file from below link and extract it.
https://drive.google.com/drive/u/0/folders/0B1QaXx7tpw3SWkMwVFBkc3djNFk
Right click on the downloaded flume tar file and select the option as Extract Here to untar the flume directory and update the path of extracted flume directory in the .bashrc file as mentioned in the below image.
NOTE: keep the path same as where the extracted file exists.

After setting the path of flume directory, save and close the .bashrc file.  And then in the terminal type the below command to update the .bashrc file.


Step 10:
Create a new file inside the conf directory inside the Flume-extracted directory.

Note: Make sure you have below jars placed in your $FLUME_HOME/lib directory:

  1. twitter4j-core-X.XX.jar
  2. twitter4j-stream-X.X.X.jar
  3. twitter4j-media-support-X.X.X.jar

Step 11:
Copy theFlumee configuration code from the below link and paste it in the newly created file.
https://drive.google.com/open?id=0B1QaXx7tpw3Sb3U4LW9SWlNidkk
Step 12:
Change the twitter api keys with the keys generated as shown in the step no 6 and step number 8.

Step 13:
We have to decide which keywords tweet data to be collected from the twitter application. So, you can change the keywords in the TwitterAgent.sources.Twitter.keywords command.
In our example, we are fetching tweet data related to Hadoop, election, sports, cricket and Big data.
Step 14:
Open a new terminal and start all the Hadoop daemons, before running the flume command to fetch the twitter data.
Use the ‘jps’ command to see the running Hadoop daemons.

Hadoop

Step 15:
Create a new directory inside HDFS path, where the Twitter tweet data should be stored.
Hadoop dfs –mkdir –p /user/flume/tweets
 
Step 16:
For fetching data from Twitter, Use the below command to fetch the twitter tweet data into the HDFS cluster path.
flume-ng agent -n TwitterAgent -f <location of created/edited conf file>

The above command will start fetching data from Twitter and steams it into the HDFS given path.

Once, the tweet data started streaming it into the given HDFS path we can use ‘Ctrl+c’ command to stop the streaming process.
Step 17:
To check the contents of the tweet data we can use the following command:
hadoop dfs –ls /user/flume/tweets

Step 18:
We can use the ‘cat’ command to display the tweet  data inside the /user/flume/tweets/FlumeData.145* path.
hadoop dfs –cat /us er/flume/tweets/<flumeData file name>


We can observe from the above image that we have successfully fetched twitter data into our HDFS cluster directory.  Once the tweets have been successfully stored in your database, you can manipulate the tweet data to fit the needs of our future projects. You can follow the above steps for the same.
Keep visiting our site www.acadgild.com for more updates on Bigdata and other technologies. Learn Bigdata Hadoop from our Expert Mentors

Related Popular Courses:

ANDROID GOOGLE TRAINING

KAFKA STREAMS

COURSES FOR DATA SCIENTIST

DATA ANALYSIS COURSES

Hadoop

Tags

prateek

An alumnus of the NIE-Institute Of Technology, Mysore, Prateek is an ardent Data Science enthusiast. He has been working at Acadgild as a Data Engineer for the past 3 years. He is a Subject-matter expert in the field of Big Data, Hadoop ecosystem, and Spark.

28 Comments

  1. Don’t we need to configure the flume-env.sh file?
    I followed all the specified steps, I’m received this error,
    No configuration found!!

  2. Yes,
    In flume-env.sh file, set JAVA_HOME according to your path.
    Example:
    JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.24

  3. Thanks for step-by-step instructions appreciate your efforts. Is there an article which explains after loading data from Twitter how to perform analysis? like how to understand sentiments around these tweets what to consider and what to ignore.

  4. Hi,
    I followed the steps given, but running into authorization error:
    I have created twitter app and configured my flume.conf with consumer key and access tokens etc.
    Still then i am running into authentication errror.
    As per error, I verified consumerKeym, consumerSecret, accessToken, accessTokenSecret and found it to be correct.
    ****************************
    16/09/03 04:48:33 ERROR twitter.TwitterSource: Exception while streaming tweets
    401:Authentication credentials (https://dev.twitter.com/pages/auth) were missing or incorrect. Ensure that you have set valid consumer key/secret, access token/secret, and the system clock is in sync.
    \n\n\nError 401 Unauthorized
    HTTP ERROR: 401
    Problem accessing ‘/1.1/statuses/sample.json?stall_warnings=true’. Reason:

        Unauthorized

    Relevant discussions can be found on the Internet at:
    http://www.google.co.jp/search?q=d0031b0b or
    http://www.google.co.jp/search?q=1db75513
    TwitterException{exceptionCode=[d0031b0b-1db75513], statusCode=401, message=null, code=-1, retryAfter=-1, rateLimitStatus=null, version=3.0.3}
    at twitter4j.internal.http.HttpClientImpl.request(HttpClientImpl.java:177)
    at twitter4j.internal.http.HttpClientWrapper.request(HttpClientWrapper.java:61)
    ****************************
    Note: I am running same in AcadGild VM

    1. If this is problem and your credential are accurate (look at the spaces) then please ensure the machine clock time is accurate make the clock time right and then start flume it will definitely work.
      Thank you

  5. The extracted data in hdfs folder /user/flume/tweets is showing Junk value. And I think that due to this in hive “select * from load_tweets” is throwing exception:
    Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character (‘O’ (code 79)): expected a valid value (number, String, array, object, ‘true’, ‘false’ or ‘null’)
    at [Source: [email protected]; line: 1, column: 2]

    1. HDFS Sink writes events to HDFS.
      Files will be stored at “hdfs.path” described in flume-config file.
      Ex:
      # Describe/Configure Sink
      TwitterAgent.sinks.HDFS.type=hdfs
      TwitterAgent.sinks.HDFS.hdfs.path=/user/cloudera/flume/tweetsinput

  6. recieving this error , after using twitter agent to extract data from twitter
    any help ?
    Relevant discussions can be found on the Internet at:
    http://www.google.co.jp/search?q=d0031b0b or
    http://www.google.co.jp/search?q=1db75513
    TwitterException{exceptionCode=[d0031b0b-1db75513], statusCode=401, message=null, code=-1, retryAfter=-1, rateLimitStatus=null, version=3.0.3}
    at twitter4j.internal.http.HttpClientImpl.request(HttpClientImpl.java:177)
    at twitter4j.internal.http.HttpClientWrapper.request(HttpClientWrapper.java:61)
    at twitter4j.internal.http.HttpClientWrapper.get(HttpClientWrapper.java:89)
    at twitter4j.TwitterStreamImpl.getSampleStream(TwitterStreamImpl.java:176)
    at twitter4j.TwitterStreamImpl$4.getStream(TwitterStreamImpl.java:164)
    at twitter4j.TwitterStreamImpl$TwitterStreamConsumer.run(TwitterStreamImpl.java:462)
    17/05/18 10:17:41 INFO twitter4j.TwitterStreamImpl: Waiting for 240000 milliseconds
    17/05/18 10:21:41 INFO twitter4j.TwitterStreamImpl: Establishing connection.
    17/05/18 10:21:42 INFO twitter4j.TwitterStreamImpl: 401:Authentication credentials (https://dev.twitter.com/pages/auth) were missing or incorrect. Ensure that you have set valid consumer key/secret, access token/secret, and the system clock is in sync.
    \n\n\nError 401 Unauthorized
    HTTP ERROR: 401
    Problem accessing ‘/1.1/statuses/sample.json?stall_warnings=true’. Reason:

        Unauthorized

    17/05/18 10:21:42 ERROR twitter.TwitterSource: Exception while streaming tweets
    401:Authentication credentials (https://dev.twitter.com/pages/auth) were missing or incorrect. Ensure that you have set valid consumer key/secret, access token/secret, and the system clock is in sync.
    \n\n\nError 401 Unauthorized
    HTTP ERROR: 401
    Problem accessing ‘/1.1/statuses/sample.json?stall_warnings=true’. Reason:

        Unauthorized

    Relevant discussions can be found on the Internet at:
    http://www.google.co.jp/search?q=d0031b0b or
    http://www.google.co.jp/search?q=1db75513
    TwitterException{exceptionCode=[d0031b0b-1db75513], statusCode=401, message=null, code=-1, retryAfter=-1, rateLimitStatus=null, version=3.0.3}
    at twitter4j.internal.http.HttpClientImpl.request(HttpClientImpl.java:177)
    at twitter4j.internal.http.HttpClientWrapper.request(HttpClientWrapper.java:61)

    1. If this is problem and your credential are accurate (look at the spaces) then please ensure the machine clock time is accurate make the clock time right and then start flume it will definitely work.
      Thank you

  7. Hi, thanks for this amazing tutorial.
    I followed what is described in this tutorial, but when I want to look at the flume files; I.e. when I execute the command ‘ hadoop dfs –cat /user/flume/tweets/ ‘ what I see is Chinese and question marks and illegible characters.
    Do you have any idea what it could be?

  8. Hi. Thanks for nice article. I could get data into HDFS. But seems like data is not in Text or JSON Format. Is this data stored in Avro format (which is binary and compressed). I am not able to view the contents of FlumeData using cat or vim. Please help.

  9. how do we know that data of which period is being offered to us?
    if i want to access data of only past 1 year then how to specify it?

  10. Thanks for this beautiful article . But can you help me in the further procedure like in visualization process means how can I use this tweets or what should I do to show them in a visualization manner according to need.
    Please help me in this problem

  11. twitter4j-core-X.XX.jar
    twitter4j-stream-X.X.X.jar
    twitter4j-media-support-X.X.X.jar
    From where will I find these jar files as mentioned in the blog?

    1. These jars are already present inside AcadgildVM for students of Acadgild.
      Others can also get the jars over the internet but need to check Hadoop ecosystem version compatibility.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close