In this blog post, we discuss an industry scenario which employs Spark to process data in real-time. The source of the data is Apache Flume. Flume is a service, which can move large amounts of data. It is usually disperse and can process all forms of data. Industries use Flume to process real-time log data.
Features of Apache Flume
Undoubtedly, Apache Flume is robust and reliable due to its tunable reliability and recovery mechanisms. It also uses a simple extendable model for data that allows the application of online analytics. Flume sends the data to the Spark ecosystem, where data acceptance and processing happens
The Architecture of Flume:
For more on Apache Flume, log on to – https://flume.apache.org/
Spark Streaming is a component of the Spark ecosystem that enables scalable, high-throughput, fault-tolerant stream processing of live data. Also, the source of this data can be any of the following: Kafka, Flume, TCP sockets, etc. The data can be processing can be done using complex algorithms, which expresses high-level functions like map, filter, flatMap, reduce, join, window, etc.
The Architecture of Spark Streaming:
For more onSpark Streaming, log on to – https://spark.apache.org/docs/latest/streaming-programming-guide.html
Integration of Flume with Spark Streaming
There are two ways to integrate Flume with Spark Streaming:
Push-base approach: In this method, Flume sends data to its agents. Spark Streaming sets up a Flume agent called Avro, which receives the data for it.
Pull-base approach: In this method, instead of pushing data to its agents, Flume sets up a sink. The data is then sent to this sink where it remains until Spark Streaming uses a Flume receiver to pull it from there. These transactions are only complete after data acceptance and replication by Spark Streaming.
Real-Time Case Study
Now let’s see integration of Spark Streaming with Flume using the push-base approach. Consider use Case-1 :
To get live tweets from Twitter using Flume, and then transfer them to Spark Streaming for processing and analysis.
Let us learn how to stream Twitter data using Flume visit : –
Overview Of Configuration File
Further more, the name of the agent is MyAgent. Name of the source is Twitter. Name of the sink is Avrosink. The details pertaining to twitter like type, consumerKey, consumerSecret, accessToken, and accessTokenSecret should be given. You can get them from Twitter. To know how to visit:
The configuration of the sink can be seen below:
- Type: Avro
- Hostname: Localhost
- Port number: 1177
Spark Stream receives the data using this hostname and port number. The Flume has now been configured according to the use case. Use the following commands to run this test:
Shifting focus to Spark Streaming, let us begin by creating an sbt project using IntelliJ IDE, which requires the build.sbt file to be configured as follows:
We have to provide the Spark Streaming and Flume based jar. Then, Spark Core and dependencies for Spark SQL to convert the Dstream in RDD or Data frames should also be given. We can then create a Spark session as follows:
After creating the session, we can get the Spark context using an instance as is shown here:
Once we create a StreamingContext instance calling it as ssc., which accepts the instance from Sparkcontext. The time set for each batch is five seconds. In other words, all the gathered tweets and data will be batched for five seconds before it’s sent to processing. The use of “Filter” is to gather tweets according to keywords intent on Hadoop, Big Data, etc., to create the instance that we employ to transfer data from Flume to Spark. We use FlumeUtils class at it deliver smultiple APIs. The API which creates stream is “createStream”.
The IP here is same as the port number in the Flume configuration file. Hence, if we print the tweets, we can expect following output:
If we get the above output, it means the configurations are working fine, and that Flume is sending data that Spark Streaming is receiving correctly.
Therefore, once we have the tweets, we can count them for some window function every ten or sixty seconds. We need to decide, however, on the way to analyze the tweets. Now we can add, both RDD and DataFrame as the streams are of the Dstream type.
We can get all the tweets that begin a with # (hashtag) in the following manner :
Let us convert the Dstream into DataFrame using Spark SQL. We are using this tool as it is the most popular processing tool in the Spark ecosystem.
We can count the number of tweets in the last sixty seconds by using the window function as follows:
Hence the result is as follows:
Hope this blog post helped you understanding flume with Spark and stay tuned for more Big Data notes. Enroll for the big data and hadoop training with Acadgild and become a successful Hadoop Developer.