Introduction to Big Data
This blog is about Big Data, its meaning, and applications prevalent currently in the industry.
It’s an accepted fact that Big Data has taken the world by storm and has become one of the popular buzzword that people keep pitching around these days.
Eric Schmidt, ex-Google CEO said in 2010, “There were 5 Exabyte of information created between the dawn of civilization through 2003, but that much information is now created every 2 days.”
This shows how enormous Big Data has become. The interesting part is that Big Data can be implemented in almost every industry sector like banking, logistics, retail, e-commerce and social media. All these industries have adopted Big Data practices and have come up with outstanding success.
Having said that let us have a look at some of the success stories related to industries and sectors using Big Data technologies:
Banks across the world have used Big Data technology to ensure customer loyalty.
Banks while issuing credit cards have the luxury to analyze loads of information to reduce the chance of credit card fraud.
Facebook, a game changer in today’s modern world has been able to predict political opinion, intelligence and emotional stability of its users based on their activity on Facebook.
Facebook has been able to feed the news or ads which we require, based on our past behavior on Facebook and this has been possible only because of Big Data analytics.
Cricket websites like ESPN and Cricbuzz have been able to predict how the bowler will bowl or what kind of shot a batsman will play. This has been made possible by Big Data analytics.
Retail and E-commerce industries have used Big Data for the prediction of product demands, for analyzing consumer behavior patterns etc.
Companies have been able to understand the buying patterns of customers and have started their production based on the results. All the business decisions have been driven by Big Data analytics to predict product demand, consumer behavior patterns & supply chain mechanics. For example, Big Data can be used to analyze customer behavior and their buying patterns and this information can, in turn, help the retailers in selling exactly what the customer needs!
Characteristics of Big Data
Let us understand the characteristics of Big Data. “Any data which has four Vs i.e. Volume, Variety, Veracity and Velocity can be termed as Big Data”.
Below is a description of all the four Vs:
Volume: It represents the amount of data and is one of the main characteristics that makes data “big”. This refers to the mass quantity of data that organizations have been trying to harness and to improve decision making across the enterprise.
Velocity: This characteristic represents the motion of the data. It has changed the mindset of the past that the data of yesterday, past hour or minute is the recent data. The data movement is now almost real time and the update window has reduced to the fraction of a second. Because of this real-time nature of data creation, enterprises have invested a lot to develop Big Data solutions which could incorporate streaming data into business processes and decision making.
Variety: It defines different types of data and data resources. The world has moved beyond the traditional means of structured data like bank statement which included information like date, amount, and time. New categories have been added to the list of data types.
Unstructured data i.e. the data that does not have a well-defined set of rules, for example, Twitter feeds, audio files, MRI images, web pages, web logs has contributed immensely to the rise of Big Data.
Veracity: It can be termed as the trustworthiness of the data i.e. calculation of the noises, biases and abnormality in the data. We may also define veracity as the level of reliability associated with certain types of data.
Processing of Big Data
Let us now understand how Big Data is processed. The following are the steps involved:
Identification of a suitable storage for Big Data
Data cleaning and processing (Exploratory data analysis)
Visualization of the data
Apply the machine learning algorithms (If required)
Identification of a Suitable Storage for Big Data
The first step of Big Data analysis starts with the identification of appropriate storage for Big Data. In Big Data world, HDFS is one of the most preferred file system for storing Big Data.
Hadoop Distributed File System (HDFS)
It is a distributed file system that provides high-throughput access to application data. Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster.
HDFS has a Master-Slave architecture because there is a Master which takes control of all the Slaves. Here the Master is named as NameNode and the Slaves are named as DataNode.
Listed below are the reasons why organizations prefer HDFS as an underlying storage for Big Data:
HDFS is made up of commodity hardware which makes it cost effective.
HDFS is a fault tolerant file system and can store the same copy of data, multiple times (replication of data). So, even if one copy is unavailable, the same copy can be retrieved from other location of the HDFS.
HDFS can be easily used by many processing frameworks like:
Data ingestion refers to taking data from the source and placing it in a location where it can be processed. Since we are using Hadoop HDFS as our underlying framework for storage and related echo systems for processing, we will look into the available data ingestion options. The following are the data ingestion options:
Batch load from RDBMS using Sqoop
Data loading from files
Real-time data ingestion
Let us now discuss the above methods for data ingestion in detail:
Batch Load from RDBMS using Sqoop
Enterprises that use Hadoop are finding it necessary to transfer some of their data from traditional, Relational Database Management Systems (RDBMS) to the Hadoop ecosystem.
Sqoop, an integral part of Hadoop, can perform this transfer in an automated fashion. Moreover, the data imported into Hadoop can be transformed with MapReduce before exporting them back to the RDBMS. Sqoop can also generate Java classes for programmatically interacting with imported data.
Sqoop uses a connector based architecture that allows it to use plugins for connecting to external databases.
For more info about Sqoop and its usage Click here
Data loading from files
Use File Transfer Protocol (FTP) to transfer the data to client nodes and then load the data using the ETL tool. Some of the ETL tools like Informatica, Talent can be integrated.
Real-time data ingestion
Below is the list of some of the tools which have enabled the real-time ingestion in HDFS:
Flume is a service for streaming logs into Hadoop. Apache Flume is a distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of streamed data into the Hadoop Distributed File System (HDFS).
For more information on Flume, Click here
Storm is a distributed real-time computation system for processing large volumes of high-velocity data.
Storm is extremely fast as it has the ability to process over a million records per second, per node on a cluster of modest size. Enterprises harness this speed and combine it with other data access applications in Hadoop to prevent undesirable events or to optimize positive outcomes.
For more information on Storm Click here
Apache Kafka supports a wide range of use cases as a general-purpose messaging system for scenarios where high throughput, reliable delivery, and horizontal scalability are important.
Apache Storm and Apache HBase both work very well in combination with Kafka.
Data Cleaning and Processing (Exploratory Data Analysis)
After getting the data into HDFS, we should clean the data and bring it to a format which can be processed.
A common traditional approach is to use a sample of the large dataset which could fit in memory. But with the arrival of Big Data, processing tools like Hadoop can now be used to run many exploratory data analysis tasks on full datasets, without sampling. Just write a MapReduce job, PIG or HIVE script, launch it directly on Hadoop over the full dataset, and get the results right back to your laptop.
Let’s have a detailed discussion on various processing and cleaning methodologies provided by Hadoop.
Java MapReduce is a native MapReduce in Java. We write code in Java as map and reduce, suitable for data which has no structure or is semi-structured.
Pig is the data flow language which allows users to write complex MapReduce operations in a simple scripting language. Then Pig transforms those scripts into MapReduce job.
The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time, this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Cloudera Impala provides high-performance, low-latency SQL queries on data stored in popular Apache Hadoop file formats. The fast response to the queries enables interactive exploration and fine-tuning of analytic queries, rather than long batch jobs traditionally associated with SQL-on-Hadoop technologies (You will often see the term “interactive” applied to these kinds of fast queries with human-scale response times).
Visualization of the Data
Data visualization is the presentation of processed data in a pictorial or graphical format. It enables decision makers to see analytics presented visually so that they can grasp difficult concepts or identify new patterns. With interactive visualization, you can take the concept a step further by using technology to drill down into charts and graphs for more detail, interactively changing what data you see and how it’s processed.
There are multiple tools for visualizing the processed data:
Tableau is the most popular visualization tool which supports a wide variety of charts, graphs, maps and other graphics. It is a completely free tool and the charts that you create with it can be easily embedded into any web page. It has a gallery which displays visualizations created via Tableau.
QlikView is a wonderful tool for data discovery, providing powerful tools to navigate easily through a sea of data in an intuitive, easy and clear way, allowing one to proceed from facts to Key Performance Indicators (KPI) and vice versa. QlikView can be used both as an advanced reporting tool as well as a Business Intelligence KPI tool, becoming the base for continuous process improvements.
Application of the Machine Learning Algorithms
Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions, rather than following strict static program instructions.
Modern day processing and visualization of the Big Data have provided a strong platform to Machine learning algorithms to achieve better results for the companies using the techniques such as clustering, classifications, outlier detection and product recommenders.
Historically, large datasets were not available or too expensive to acquire and store, and so machine-learning practitioners had to find innovative ways to improve models with rather limited datasets. With Hadoop as a platform that provides linearly scalable storage and processing power, you can now store ALL of the data in RAW format, and use the full dataset to build better, more accurate models.
This sums up the steps involved in the processing of Big Data.
We hope this blog helped you in getting a grip over Big Data and the steps required to process it.