Big Data Hadoop & Spark

Beginner's Guide for Apache Flink

In this post, we will be discussing Apache Flink, its installation in a single node cluster and how it is a contender for the present Big Data frameworks.

Let’s begin with the basics.

What is Apache Flink?

Apache Flink is an open-source platform for distributed stream and batch data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.

Initially developed at a technical university in Berlin, Flink was later on added to Apache’s incubator. People say that it is a replacement for Hadoop and Spark that works in both batch and stream processing. It eliminates the Hadoop’s map and reduce rule with its in-memory processing, leveraging heavy performance gain.

How is Flink a contender for Hadoop and Spark?

We have heard about term ‘in-memory’ previously. Yes, in Spark! Spark is famous for its in-memory processing model and its various API’s, and because of its unified engine for various processing tools. Similarly, Flink also follows the same approach. It is also building its own unified engine by combining different type of tools in it (Later on, we will look at the technology stack of Flink). Flink and Spark have many similarities in their technology stack; almost all the features of Spark are present in Flink.

However, Flink owes its popularity to core feature. Flink is mainly built for stream processing and has gained importance over Spark for this very reason. Spark has its own streaming engine for processing streaming data, but it is not real-time (we say it as near real-time). It is also known as micro-batch processing which is of high latency. In order to get low latency, Banking organizations use Storm(Streaming framework built on top of Hadoop), where every millisecond is important.

Both Flink and Storm can be used for getting low latency, but why do we prefer Flink to Storm? In the Flink framework, we will get all the technology stack of Spark and streaming with low latency and with various API’s. So Flink is a complete package.

So, how does Flink get a high speed of processing by using in-memory? Flink implements its own memory management inside the JVM. Applications scale to data sizes beyond main memory and experience less garbage collection overhead.

Let’s take a look at the technology stack of Flink now.

In the below image, you can see that all the technology stack of Apache Flink.

Flink includes several APIs for creating applications that use the Flink engine:

  1. DataStream API for unbounded streams embedded in Java and Scala, and Python.
  2. DataSet API for static data embedded in Java, Scala, and Python.
  3. Table API with an SQL-like expression language embedded in Java and Scala.

Flink also bundles libraries for domain-specific use cases:

  1. CEP, a complex event-processing library.
  2. Machine Learning library.
  3. Gelly, a graph processing API, and library.

You can integrate Flink easily with other well-known open-source systems, both for data input and output, as well as deployment.

Now, let’s see how Apache Flink is installed in a single node cluster.

Apache Flink Installation

You can download Flink from here, based on your Hadoop version.

After downloading, open your terminal and untar the file using the below command, as shown in the below screenshot.

tar -xvzf flink-1.0.3-bin-hadoop27-scala_2.10.tgz

Now, you can see that a folder has been created with the name flink-1.0.3.

Next, let’s export the path of Flink in bashrc file. Open the bashrc file using the command gedit .bashrc.

Now, add the below lines which contain the path of flink-1.0.3 in the bashrc file.

#set flink_home

FLINK_HOME=/path_to/flink-1.0.3

export PATH=$PATH:$FLINK_HOME/bin

After adding the above line, save and close the file. In the terminal, type the command source .bashrc to update the bashrc file.

Now, we can start the Flink daemons directly from the terminal home using the below command.

start-local.sh

In the port 8081, you can see the Flink dashboard as shown in the below screenshot. Using this web UI, you can track all the jobs scheduled in Flink.

We hope this post has been helpful in understanding Flink and how it’s installed on a single node cluster. In the case of any queries, feel free to comment below and we will get back to you at the earliest.

Stay tuned for our next post, where we will look at how to write and run a word count program in Flink.

Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close