Big Data Hadoop & Spark

Spark With Elasticsearch

In this blog post, let’s dive deep and discuss the detailed concept of Spark with Elasticsearch Integration.

What is Apache Spark?

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and extends the MapReduce model to efficiently implement for more computations like Interactive queries and Stream processing.
Spark has an exciting feature called in-memory cluster computing that increases the processing speed of an application. A wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming are covered using Spark. Apart from this workload support, it also helps in reducing the management burden of maintaining separate tools.

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

What is Elasticsearch?

Elasticsearch is a real-time, open-source full-text search and analytics engine which is accessible from RESTful Web Service Interface.

ElasticSearch – Characteristics

Here are some of the characteristics of Elasticresearch:

  • Uses schema-less JavaScript Object Notation (JSON) documents to store data.
  • Built on Java programming language and is platform-independent which means spark with Elasticsearch runs on different platforms.
  • Enables users to explore the very large amount of data at a very high speed.

ElasticSearch – Key Concepts

Here are some of the key concepts of Elasticsearch. Let’s see each one briefly:

  • Node − It refers to a single running instance of Elasticsearch. Single “Physical and Virtual” server accommodates multiple nodes depending upon the capabilities of their physical resources like RAM, storage, and processing power.
  • Cluster − It is a collection of one or more nodes. It provides collective indexing and search capabilities across all the nodes for the entire data.
  • Index − It is a collection of different type of documents and document properties. This implements the concept of shards to improve the performance.
  • Type/Mapping − It is a collection of documents sharing a set of common fields present in the same index. For example, an Index contains data of a social networking application, and then there can be a specific type of user profile data, another type for messaging data and another for comments data.
  • Document − It is a collection of fields in a defined in a JSON format. Every document belongs to a type and resides inside an index. Every document associates with Unique Identifier (UID).
  • Shard − The horizontal subdivision of indexes. This means each shard contains all the properties of a document but contains less number of JSON objects than an index. The horizontal separation makes shard an independent node and stores in any node.
  • Replicas − Elasticsearch allows a user to create replicas of their indexes and shards. Replication not only helps in increasing the data availability in case of failure but also improves the search performance by carrying out a parallel search operation in these replicas.

Hadoop

Use Cases

Now let’s practice some use cases of Spark with Elasticsearch. We will be basically using Spark Core and Spark SQL in order to perform some computations on data present in ES.
First, the user must make sure the following actions are performed:

  1. The required dependencies are gathered to run the sample.
  2. The sample is defined in build.sbt file.


Now, create a SparkContext and integrate the created SparkContext with Elasticsearch which is running locally.

When the client-only mode is enabled, Elasticsearch-hadoop will route all its requests (after nodes discovery, if enabled) through the client nodes within the cluster.
Note, this significantly reduces the node parallelism and disables by default.
We are going to process one file contains few details about United Nations in the text format with each of the words separated by ‘ ‘ and lines separated with ‘\n’ separator. The below screenshot clearly shows how the input file looks like.

Let’s consider this data as input to our Spark application and compute the number of each discrete word count present in this file. The following code snippet is the computation logic to process the file and compute the word count and finally, saves into the Elasticsearch.

The below screenshot shows how the data gets into Elasticsearch with a defined index value.

Hope this article on Spark with Elasticsearch helps. We will be back with our new blog post shortly.For more details on spark with Elasticsearch & in depth of Spark, enroll for the big data and Hadoop training with Acadgild and become a successful Hadoop Developer.
Spark

Tags

prateek

An alumnus of the NIE-Institute Of Technology, Mysore, Prateek is an ardent Data Science enthusiast. He has been working at Acadgild as a Data Engineer for the past 3 years. He is a Subject-matter expert in the field of Big Data, Hadoop ecosystem, and Spark.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close