In this blog, we will be discussing how to start Big data development using Spark. For this let us see what is Big data and what are the other Big Data frameworks that are available and why we should go with Spark.
What is Big data?
Big data is the term used to denote the large volume of files and this Big data can be termed with 3 V’s. Now a 4th V has also been added and the V’s are as follows:
Volume is the main character of Big Data by which we can understand that huge dataset ranging from petabytes to zettabytes.
Variety stands for the types of data whether it may be structured, unstructured or semi-structured data.
Velocity stands for the speed or frequency of the incoming data that is to be processed.
Veracity stands for how certain the data we have collected. Sometimes so much of junk data is generated and people have no use for it.
So this was about what is Big Data in brief. Now let us see who is generating this data and what is its use.
Sources of Big data?
Data is Everywhere! These days everything is online. So, data will be present everywhere. Everything is a record. Some useful information can be obtained out from a record.
Here are the few biggest sources of Big Data.
Social Networking sites are one of the largest data generating sources. Activity is tracking the data of persons, companies or any other public sector data like banking data like their day to day transaction records, data warehouse applications and much more.
Above mentioned are some very few sources of Big data. But what we get from this data?
Nothing is a waste in this world! Everything has its own use in some way. The Big data is useful for performing analytics. Analytics will derive more important insights which might be useful for further proceedings.
And this insight can be as follows.
If you consider social networking sites, these are a powerful medium. Currently, the presence of a company and its tracking decides the value of the company.
Activity tracking of a person in banks can avoid fraud detections.
Processing data in the data warehouse will become easier with the introduction of Big Data frameworks.
Big data is also useful in health and care for treatment of many diseases including cancer.
These are very few uses of Big Data.
Now let us see what are the frameworks that can be used to process big data.
The most important Big Data framework available is Hadoop.
What is Hadoop?
Hadoop is a cluster computing framework developed to solve the problem of Big Data. It is an open-source Java based platform for handling Big Data. It is a combination of two components, HDFS and MapReduce. HDFS is for storing the data while MapReduce is for processing the data.
HDFS is like a Linux file system which can store any type of data which may be structured, semi-structured or unstructured.
Hadoop is a Java-based robust framework which can handle any type of data that may be structured or semi-structured or even unstructured data. This power of handling all types of data is given by HDFS which is a distributed file system in Hadoop which is the eco-system which stores the data in a distributed fashion. There is another term called MapReduce which is used to process the collected data in HDFS in key-value pairs.
But Hadoop can only do Batch processing which means Hadoop can only process data which was collected previously.
So, Hadoop is efficient in defining two V’s 1st V i.e., Volume and 2nd V i.e., Variety. Now, the question, what about the 3rd V?
Hadoop cannot process streaming data. Stream processing is not there in Hadoop. For performing stream processing there came another framework called Apache Storm which is developed to process streaming data.
To add power to the Hadoop framework, this external framework is also added to the Hadoop family by introducing a new cluster manager called YARN.
There are mainly two versions in Hadoop which are popularly known as Hadoop1.x and Hadoop 2.x
It is the very first version of Hadoop built to handle the Big Data.
HDFS and MapReduce are the two steps involved in processing the data in Hadoop1.x architecture.
It processes data in batches which can be named as Batch processing.
To enhance the features of Hadoop1.x and to overcome the problems in the previous version Hadoop2.x is introduced.
A new term called YARN (Yet Another Resource negotiator) was introduced in Hadoop2.x.
HDFS+YARN is used to process the data.
With the introduction of YARN Hadoop can include some more tools in it, to process the data like Map reduce, hive, pig, Storm, Spark, etc.,
Hadoop is not intended to one who knows Java can develop applications using MapReduce . Hadoop gives ease to everyone. A person who is working in SQL can use Hadoop by using an external tool called Hive and a person who is working on ETL can use Hadoop by using an external tool called Pig. Similarly, Hadoop gave ease to include many external tools.
Here is the complete stack of the alternatives for Hadoop’s MapReduce.
• Apache Spark– Open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.
• GraphLab – A redesigned fully distributed API, HDFS integration and a wide range of new machine learning toolkits.
• HPCC Systems – (High-Performance Computing Cluster) is a massive parallel-processing computing platform that solves Big Data problems.
• Dryad– It is investigating programming models for writing parallel and distributed programs, to scale from a small cluster to a large data-center.
• Apache Flink – Open source distributed data processing platform. Distributed programs are represented as a DAG of operators (such as join, map, group, ..)
• Storm – It is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data. It does for real-time processing what Hadoop did for batch processing. The storm is simple, can be used with any programming language, and is a lot of fun to use!
• R3 – It is a map reduce engine written in python using a Redis backend.
• Disco – It is a lightweight, open-source framework for distributed computing based on the MapReduce paradigm.
• Phoenix – It is a shared-memory implementation of Google’s MapReduce model for data-intensive processing tasks.
• Plasma – PlasmaFS is a distributed filesystem for large files, implemented in user space. Plasma Map/Reduce runs the famous algorithm scheme for mapping and rearranging large files. Plasma KV is a key/value database on top of PlasmaFS
• Peregrine – It is a map reduce framework designed for running iterative jobs across partitions of data.
• HTTP MR – A scalable data processing framework for people with web clusters.
• sector/sphere – Sector is a high performance, scalable, and secure distributed file system. The sphere is a high-performance parallel data processing engine that can process Sector data files on the storage nodes with very simple programming interfaces.
• Filemap – It is a lightweight system for applying Unix-style file processing tools to large amounts of data stored in files.
• misco – It is a distributed computing framework designed for mobile devices
• MR-MPI – It is a library, which is an open-source implementation of MapReduce. It is written for distributed-memory parallel machines on top of standard MPI message passing
• GridGain – It is in-memory computing
But while using Hadoop there is some other overhead i.e., as the dataset size is too large, Hadoop is taking more time to process the data. The time to process the data keeps on increasing with respect to the size of the data. To overcome this, something called in-memory processing came into the picture.
Spark is the first Big Data framework introduced to process data in-memory. This is something from which Spark came into the picture. It can process data 10-100X times faster than Hadoop.
Not only In-memory Spark offers a lot more. It can efficiently process streaming data. For this, it has its own streaming engine called Spark streaming.
For SQL developers, it has its own SQL engine called Spark SQL. For applying machine learning it has its own machine learning library called Mllib and for graphs, it has its own Graph engine called GraphX.
Spark is a complete unified framework for handling Big data. Spark fulfilled 3V’s of Big data. It can process the large volume of data efficiently within very less time than Hadoop. It can handle all the varieties of data and it can handle data coming with high velocity as well.
So you can start learning Big data using Spark and now poses the question how?
Spark application can be developed in different languages like Java, Scala, Python and R. As spark is developed in Scala most of the Spark’s API’s are similar that of Scala. So it is easier to develop a Spark applications in Scala if you are opting to develop in Scala.
After the selection of programming language, there comes the storage system. Spark doesn’t have its own storage system. It needs to connect with external storage systems like HDFS, MongoDB, Cassandra, S3 etc., so you can learn about any of these storage systems.
While coming to the cluster manager, Spark can be run on YARN or Mesos or on its own cluster manager. You can learn about any of these cluster managers.
In between if you want to use SQL inside your application, you can do that. If you want to use Machine learning you can do that as well. If you want to draw graphs on your data, you can do that also. You can do all these using Spark.
That’s how you can start learning Big data by using Spark.
We hope this blog helped you understand, how to learn Big Data using Spark. Keep visiting our site www.acadgild.com for more updates on Big data and other technologies.