Free Shipping

Secure Payment

easy returns

24/7 support

  • Home
  • Blog
  • Spark Streaming using TCP Socket

Spark Streaming using TCP Socket

 July 19  | 0 Comments

Learn the Spark streaming concepts by performing its demonstration with TCP socket.
We also recommend users to go through this link to run Spark in Eclipse.
Spark Streaming is an extension of core Spark API, which allows processing of live data streaming. In layman’s terms, Spark Streaming provides a way to consume a continuous data stream, and some of its features are listed below.

  • Enables scalable, high throughput, and fault-tolerant data processing.
  • Supports many input sources like TCP sockets, Kafka, Flume, HDFS/S3, etc.
  • Uses a micro-batch architecture.

Now, let us see the internal working of Spark Streaming.

Spark Streaming

Figure: Apache Spark doc

  • Spark Streaming continuously receives live input data streams and divides the data into multiple batches.
  • These new batches are created at regular time intervals, called batch intervals. The application developer can set batch intervals according to their requirement.
  • Any data that arrives during an interval gets added to the batch.
  • At the end of a batch interval,

In this post, we will discuss the Spark streaming concepts by performing its demonstration with TCP socket.
We also recommend users to go through this link to run Spark in Eclipse.
Spark Streaming is an extension of core Spark API, which allows processing of live data streaming. In layman’s terms, Spark Streaming provides a way to consume a continuous data stream, and some of its features are listed below.

  • Enables scalable, high throughput, and fault-tolerant data processing.
  • Supports many input sources like TCP sockets, Kafka, Flume, HDFS/S3, etc.
  • Uses a micro-batch architecture.
  • Spark engine processes these batches.

Spark Streaming is built on an abstraction called Discretized Stream or DStream. It represents the sequence of data arriving with time. Internally, each DStream is represented as a sequence of RDDs. A DStream is created from StreamingContext.

Restriction

We can only have one StreamingContext per JVM.
Once a DStream is created, it allows two kinds of operations: Transformation and Output operation.
Now, we will see a demo of Spark Streaming from a TCP socket. In this, we will perform the task of counting words in text data received from a data server listening on a TCP socket.

import org.apache.spark._
import org.apache.spark.streaming._
object NetworkWordCount {
def main(args:Array[String]) {
val SparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[2]")
// Create a local StreamingContext with batch interval of 10 second
val ssc = new StreamingContext(sparkConf, Seconds(10))
/* Create a DStream that will connect to hostname and port, like localhost 9999. As stated earlier, DStream will get created from StreamContext, which in return is created from SparkContext. */
    val lines = ssc.socketTextStream("localhost",9999)
// Using this DStream (lines) we will perform  transformation or output operation.
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()        // Start the computation
    ssc.awaitTermination()  // Wait for the computation to terminate
  }
}

Parallely in another terminal, type nc -lk 9999 to run netcat as a data server.This terminal acts as a server where we will continuously feed the words, and our Spark Streaming code will count the number of occurrences (in a batch interval of 10 sec).
We are running the code from IDE. Check screenshot displayed below.As the interval has been set at 10 sec in the first batch interval acadgild, acadgild, hi was captured and the count was given as:
(acadgild,2)
(hi,1)
Hope this post has been helpful in understanding Spark Streaming. In case of any queries, feel free to comment below, and we will get back to you at the earliest. And keep visiting www.acadgild.com for more updates on the courses.

>