Big Data Hadoop & Spark - Advanced

Apache Kafka and Spark Streaming Integration

This blog describes the integration between Kafka and Spark.
Apache Kafka is a pub-sub solution; where producer publishes data to a topic and a consumer subscribes to that topic to receive the data.  It is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, and very fast.
Apache Spark is an ecosystem that provides many components such as Spark core, Spark streaming, Spark SQL, Spark Mlib, etc. and these components are used to compute the results by making Resilient Distributed Datasets (RDDs) and putting them into RAM; hence their performance is very fast.
The basic integration between Kafka and Spark is omnipresent in the digital universe. But this blog shows the integration where Kafka producer can be customized to work as a producer and feed the results to spark streaming working as a consumer.
This Kafka and Spark integration will be used in multiple use cases in the upcoming blog series. Data used in this blog is dummy data, which is used to depict the integration between Kafka and Spark. Let us see some of the sample datasets used in this blog.
Schema of data

  • Name – Name of employee
  • Age – Age of employee
  • Place – Working place of employee
  • Salary – Salary of employee
  • Department – The department in which he/she works

Sample rows/records of data look like:
(“Curious”,40,”Bangalore”,90000,”Data science”)
(“Dear”,38,”Mumbai”,100000,”Business consultant”)
Now let us see the step by step procedure for Kafka and Spark Integration.
Step 1 – Create a Kafka topic.
kafka/bin/ –zookeeper <IP:port> –create –topic <topic_name> –partitions <numeric value> –replication-factor <numeric value>

Step 2 – Verify whether created topic exists or not.
kafka/bin/ -zookeeper <IP:port>  —list

Step 3 Start Kafka server using the below command.
/kafka/bin/ /kafka/config/
Step 4 – We will write a Kafka customized producer class, that will accept the data and become the producer. Spark can also fetch data from this producer.

Step 5 – We will write a sample Java program to feed these data into a Kafka topic. As a matter of fact, we can write the logic in any program to make data available to Kafka. Our focus is to get the data from Kafka into Spark. After this step, create a jar file. If you are using eclipse/maven, use command “mvn package”.

import test.kafka.KafkaProducerSrvc;
public class FileParserSrvc {
	/*Reference to the kafka producer class, which is imported*/
KafkaProducerSrvc kafkaProducer = null ;
                          kafkaProducer = new KafkaProducerService("myTopic");
                          }catch(Exception e)
       	try {
			File file = new File("test.txt");
			FileReader fileReader = new FileReader(file);
			BufferedReader bufferedReader = new BufferedReader(fileReader);
			String line;
			while ((line = bufferedReader.readLine()) != null) {
			System.out.println("Contents of file:");
		} catch (IOException e) {

Step 6 – Since we have already seen that the Kafka producer sends the input line by line to the Spark Streaming (consumer). Spark Streaming accepts the input in batch intervals (for example, batch interval of 10 seconds) and make the batches of input for this interval. And then the Spark engine works on this batch of input data and sends the output data to further pipeline for processing.
The Spark Streaming consumer code can be written in any language such as Scala, Java, Python, etc., but we prefer Scala.

Step 7 – To test, create a jar file for the complete project.
If you are using sbt, then perform the following steps:

  1. Go to the project root directory, and run the command “sbt assembly”, which will create the jar file.
  2. Transfer the jar file to the cluster/server where Kafka server, Zookeeper server, Spark server are running. Then, run the following commands:
  • To run Kafka producer (Java program) à java –jar KafkaProducerSrvc-1.0.jar
  • To run Spark Streaming program à/spar/bin/spark-submit – -class test.FromKafkaToSparkStreaming  –master <node_address> FromKafkaToSparkStreaming-1.0.jar

After executing step 7, we can see that each input line is connected to Spark Streaming end and getting printed line by line.

Hope this blog section helped you understand the step by step process to integrate Kafka and spark. stay tuned for more technology related blogs and Enroll for the Big Data and Hadoop Training to become a successful big data developer.


One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles