Big Data Hadoop & Spark

Spark Streaming using TCP Socket

Learn the Spark streaming concepts by performing its demonstration with TCP socket.
We also recommend users to go through this link to run Spark in Eclipse.
Spark Streaming is an extension of core Spark API, which allows processing of live data streaming. In layman’s terms, Spark Streaming provides a way to consume a continuous data stream, and some of its features are listed below.

  • Enables scalable, high throughput, and fault-tolerant data processing.
  • Supports many input sources like TCP sockets, Kafka, Flume, HDFS/S3, etc.
  • Uses a micro-batch architecture.

Now, let us see the internal working of Spark Streaming.

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

Spark Streaming


Figure: Apache Spark doc

  • Spark Streaming continuously receives live input data streams and divides the data into multiple batches.
  • These new batches are created at regular time intervals, called batch intervals. The application developer can set batch intervals according to their requirement.
  • Any data that arrives during an interval gets added to the batch.
  • At the end of a batch interval,

In this post, we will discuss the Spark streaming concepts by performing its demonstration with TCP socket.
We also recommend users to go through this link to run Spark in Eclipse.
Spark Streaming is an extension of core Spark API, which allows processing of live data streaming. In layman’s terms, Spark Streaming provides a way to consume a continuous data stream, and some of its features are listed below.

  • Enables scalable, high throughput, and fault-tolerant data processing.
  • Supports many input sources like TCP sockets, Kafka, Flume, HDFS/S3, etc.
  • Uses a micro-batch architecture.
  • Spark engine processes these batches.

Spark Streaming is built on an abstraction called Discretized Stream or DStream. It represents the sequence of data arriving with time. Internally, each DStream is represented as a sequence of RDDs. A DStream is created from StreamingContext.

Restriction

We can only have one StreamingContext per JVM.
Once a DStream is created, it allows two kinds of operations: Transformation and Output operation.
Now, we will see a demo of Spark Streaming from a TCP socket. In this, we will perform the task of counting words in text data received from a data server listening on a TCP socket.

import org.apache.spark._
import org.apache.spark.streaming._
object NetworkWordCount {
def main(args:Array[String]) {
val SparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[2]")
// Create a local StreamingContext with batch interval of 10 second
val ssc = new StreamingContext(sparkConf, Seconds(10))
/* Create a DStream that will connect to hostname and port, like localhost 9999. As stated earlier, DStream will get created from StreamContext, which in return is created from SparkContext. */
    val lines = ssc.socketTextStream("localhost",9999)
// Using this DStream (lines) we will perform  transformation or output operation.
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()        // Start the computation
    ssc.awaitTermination()  // Wait for the computation to terminate
  }
}

Parallely in another terminal, type nc -lk 9999 to run netcat as a data server.

This terminal acts as a server where we will continuously feed the words, and our Spark Streaming code will count the number of occurrences (in a batch interval of 10 sec).
We are running the code from IDE. Check screenshot displayed below.

As the interval has been set at 10 sec in the first batch interval acadgild, acadgild, hi was captured and the count was given as:
(acadgild,2)
(hi,1)
Hope this post has been helpful in understanding Spark Streaming. In case of any queries, feel free to comment below, and we will get back to you at the earliest. And keep visiting www.acadgild.com for more updates on the courses.

Tags

2 Comments

  1. Pingback: Spark Streaming using tcp soket | HadoopMinds
  2. i am facing a problem during installation of apache flume . please solve my error .
    16/09/22 10:47:33 INFO twitter4j.TwitterStreamImpl: Establishing connection.
    16/09/22 10:47:36 INFO twitter4j.TwitterStreamImpl: 401:Authentication credentials (https://dev.twitter.com/pages/auth) were missing or incorrect. Ensure that you have set valid consumer key/secret, access token/secret, and the system clock is in sync.
    \n\n\nError 401 Unauthorized
    HTTP ERROR: 401
    Problem accessing ‘/1.1/statuses/sample.json?stall_warnings=true’. Reason:

        Unauthorized

    16/09/22 10:47:36 ERROR twitter.TwitterSource: Exception while streaming tweets
    401:Authentication credentials (https://dev.twitter.com/pages/auth) were missing or incorrect. Ensure that you have set valid consumer key/secret, access token/secret, and the system clock is in sync.
    \n\n\nError 401 Unauthorized
    HTTP ERROR: 401
    Problem accessing ‘/1.1/statuses/sample.json?stall_warnings=true’. Reason:

        Unauthorized

    i kindly request you to please help me

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close