Here you go. Master these 9 simple steps and you are good to go!
Why Spark & why should you go for it?
Apache Spark is one of the most active projects of Apache with more than 1000 committers working on it to improve its efficiency and stability.
Spark gives ease for the developers to develop applications. Spark offers its API’s in different languages like Java, Scala, Python, and R.
Apache spark is an Unfired framework! No need of going to any other external tool for processing the data. It is capable of handling multiple workloads at the same time.
In the above picture, you can see the complete technology stack of workloads that spark can handle. Spark can run SQL on it, streaming applications have been developed elegantly, has inbuilt machine learning library, Graph computation can also be done on the same engine.
Spark is faster! You no need to wait for longer times for the completion of jobs. It process data In-Memory because of its In-Memory processing primitives Apache Spark is 10-100X times faster than other big data frameworks like Hadoop.
Step 1: Understanding Apache Spark Architecture
Spark is an open-source distributed framework having a very simple architecture with only two nodes i.e., Master node and Worker nodes. Here is the architecture of Spark.
Spark Master contains the SparkContext which executes the Driver program and the Worker nodes contain the Executor which executes the tasks. As Spark is a distributed framework, data is stored across the worker nodes.
In the worker nodes, there is something called task where the actual execution happens. In the distributed computing, computing of a job is split up into different stages each stage is called as a task. Each JVM inside the worker machine executes each task. Similarly, in the Spark architecture also Worker node contains the executor which carries out these tasks. Here in spark, there is something extra called cache here comes the concept of In-Memory. As explained earlier Spark computes data In-Memory each worker node will be having cache memory(RAM) spark executes the tasks inside the cache memory rather than executing the task from the disk this particular feature makes Spark 10-100x faster.
In the middle there comes the cluster manager. Cluster manager is used to handle the nodes present in the cluster. Storing the data in the nodes and scheduling the jobs across the nodes everything is done by the cluster managers. Spark gives ease in these cluster managers also. Spark can run on 3 types of cluster managers. Spark can run on YARN (Native Hadoop cluster manager), can run on Apache MESOS, has its own cluster manager as well. Spark can use any of these three as its cluster manager. Spark can run in local mode too.
Step 2: Get hold of the Programming Language to develop spark applications
As explained earlier, Spark offers its API’s in different languages like Java, Scala, Python & R so programmers have their own choice to select the language to develop Spark applications. But here is something interesting for you!
Spark framework is primarily written in Scala (Both scripting and OOPS language) so most of the API functions in Spark looks similar syntactically as in Scala. So if you opt for Scala to develop your Spark applications it will be easier for you.
Spark applications are somewhat difficult to develop in Java when compared to other programming languages.
Python is also very good for developing Spark applications but not up to the production level.
Even SQL developers can work on Spark by running Sql queries using SparkSql.
Step 3: Understanding Apache Spark’s key terms
Spark’s architectural terms are the keywords that are to be known.
A cluster is a collection of machines connected to each other. Spark can also be installed in the cloud. Among these inter-connected machines one will be Spark-Master also serves as a cluster manager in a standalone cluster and one Spark driver.
Spark master is the major node which schedules and monitors the jobs that are scheduled to the Workers. In a standalone cluster, this Spark master acts as a cluster manager also. Depending on the cluster mode, Spark master acts as a resource manager who will be the decision maker for executing the tasks inside the executors.
Spark workers receive commands from the Spark master. Depending on the instructions from the master workers executes the tasks. Workers contain the executors to executes the tasks. Generally, a worker job is to launch its executors.
An executor is the key term present inside a worker which executes the tasks. Executor allocates the resources that are required to execute a task. The executor can be treated as the JVM space with some allocated cores and memory to execute the tasks.
Spark driver will be the co-ordinator soon it receives the information from the Spark master. Spark driver evenly distributes the tasks to the executors and it also receives information back from the workers.
SparkContext can be termed as the master of your Spark application. SparkContext allows the Spark driver to access the cluster through resource manager. The resource manager can be any of the cluster manager like YARN, MESOS or Spark’s cluster manager as well. SparkContext allows many functions like Getting current configuration of the cluster for running or deploying the application, setting the new configuration, creating objects, scheduling jobs, canceling jobs and many more.
SQLContext & HiveContext
SparkSql engine offers this SQLContext to execute SQL queries. HiveContext is the superset of SQL engine of Spark where you can run both Hive queries and SQL queries.
Spark applications can be deployed in many ways and these are as follows:
Local: Here the Spark driver, worker, and executors run on the same JVM.
Standalone: Here Spark driver can run on any node of the cluster and the workers and executors will be having their own JVM space to execute the tasks.
YARN client: Here Spark driver runs on a separate client but not in the YARN cluster and the workers are the Node managers and the Executors are the Node manager’s containers.
YARN cluster: Here Spark driver runs within the Spark YARN’s one of the application master and the workers are the Node managers and the Executors are the Node manager’s containers.
Mesos client: Here Spark driver runs on a separate client but no in the Mesos cluster and the workers are the slaves in the Mesos cluster and the Executors are the containers of the Mesos clients.
Mesos cluster: Here Spark driver runs on one of the master nodes of the Mesos cluster and the workers are the slaves in the Mesos cluster and the Executors are the containers of the Mesos clients.
Spark application in the cluster is as follows:
Job Scheduling process
Here is the scheduling process and stages of a Spark application inside a cluster.
Step 4: Mastering the Storage systems used for Spark
Spark do not have its own storage system. So it needs to depend on external storage systems like HDFS (Hadoop Distributed file system), MongoDB, Cassandra etc., Spark can also be integrated with many other file systems and databases.
Spark can also use S3 as its file system by providing the authentication details of S3 in its configuration files.
It can also be integrated with many databases like HBase, Mysql, MongoDB etc.,
So people should also have a proper file system or database knowledge in particular to the association of the storage system with Spark.
Step 5: Learning Apache Spark core in-depth
The core of Apache Spark is its RDD’s all the major features of Spark is because of its RDD’s. RDD stands for Resilient Distributed Datasets.
Resilient Distributed Datasets
Resilient Distributed Datasets (RDD) is a simple and immutable distributed collection of objects. Each RDD is split into multiple partitions which may be computed on different nodes of the cluster. In Spark, all function are performed on RDDs only.
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.
Let’s see now the features of Resilient Distributed Datasets in the below explanation:
In Hadoop, we store the data as blocks and store them in different data nodes. In Spark, instead of following the above approach, we make partitions of the RDDs and store in worker nodes (data nodes) which are computed in parallel across all the nodes.
In Hadoop, we need to replicate the data for fault recovery, but in the case of Spark, replication is not required as this is performed by RDDs.
RDDs load the data for us and are resilient which means they can be recomputed.
RDDs perform two types of operations: transformations which creates a new dataset from the previous RDD and actions which return a value to the driver program after performing the computation on the dataset.
RDDs keeps a track of transformations and checks them periodically. If a node fails, it can rebuild the lost RDD partition on the other nodes, in parallel.
RDDs can be created in two different ways:
Referencing an external dataset in an external storage system, such as a shared file system, HDFS, HBase, Mysql or any data source.
By parallelizing a collection of objects(a list or a set) in the driver program.
Life cycle of a Spark program:
Some input RDDs are created from external data or by parallelizing the collection of objects in the driver program.
These RDDs are lazily transformed into new RDDs using transformations like filter() or map().
Spark caches any intermediate RDDs that will be needs to be re-used.
Actions such as count() and collect are launched to kick off a parallel computation which is then optimized and then executed by Spark.
Step 6: Working with real-time data using Spark streaming
Using Spark, you can develop streaming applications easily. Spark provides its own streaming engine to process live data.
Spark streaming engine framework is as follows:
For Spark framing, there should be some input source. This input source should provide the data continuously to Spark streaming engine. The input sources are as shown in the above image i.e., Kafka, Flume, Kinesis, HDF/S3, Twitter or any other data source.
Spark process data in micro batches i.e., for every time limit Spark’s streaming engine, receives the data and process the data the time limit can be as low as in nano seconds.
After processing the data, Spark can store its results in any of the file system or databases or dashboards.
Step 7: Learn Spark SQL
Spark has its own SQL engine to run SQL queries. SparkSql stores data in data frames. A data frame is defined as a structured RDD. A dataset having a structure can be called as a data frame. Data frames can be created in any of the language like Scala, Java, Python.
SparkSql engine is as follows:
After querying the data using Spark SQL, it can be again converted into a Spark’s RDD.
Step 8: Learn Machine learning using MlLib
Spark has machine learning framework in-built. You can develop machine learning applications using MlLib. MlLib contains many in-built algorithms for applying machine learning on your data. RDD’s can be passed into the algorithms which are present in MlLib.
Applications like Recommendation engines can be built on Spark very easily and it processes data intelligently.
Using Spark’s MlLib, you can perform basic statistics like Correlations, sampling, hypothesis testing, random data generation and many more and you can run algorithms like Classification & Regression, Collaborative filtering, K-Means and many more.
GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental operators as well as an optimized variant of the pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.
We hope this blog helped you in understanding the 10 steps to master apache Spark. Keep visiting our site www.acadgild.com for more details on Big Data and other technologies.