Big Data Hadoop & Spark

Machine Learning with Spark: Determining Credibility of a Customer – Part 2

DataFrame:

A DataFrame is a new feature that has been exposed as an API from Spark 1.3.0.

A DataFrame is a distributed storage of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrame as an API is available in Scala, Java, Python and R. It allows you to process any type of structured and semi-structured data. DataFrame enables developers to impose a structure or schema onto a distributed collection of data.

Salient Features:

  • Optimized querying

  • Tabular representation

  • Support for a variety of data formats like Avro, JSON, Parquet etc…

  • Supports multiple data sources like HDFS, HIVE, RDBMS etc…

  • Support for SQL queries

  • Shuffling/Sorting without deserializing

DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs. DataFrame supports reading data from the most popular formats, including JSON files, Parquet files, Hive tables. It can read from local file systems, distributed file systems (HDFS), cloud storage (S3), and external relational database systems via JDBC. In addition, through Spark SQL’s external data sources API, DataFrames can be extended to support any third-party data formats or sources. Existing third-party extensions already include Avro, CSV, ElasticSearch, and Cassandra.

Why DataFrame when we already have RDD?

You can easily optimize RDDs in the DataFrame. Unlike RDDs , Dataframes keep track of the schema and support various relational operations that lead to more optimized execution. On top of that, DataFrame has all the advantages that RDD provides.

Similar to RDDs, DataFrames are evaluated lazily. That is to say, computation only happens when an action (e.g. display result, save output) is required. All DataFrame operations are also automatically parallelized and distributed on clusters.

RDDs are more like low-level APIs, where you have to optimize your execution plan as per your need, whereas DataFrames are abstract APIs with much of the optimizations being done internally.

Also, DataFrame gives you an option of schema inferencing where it can guess the schema by looking at few columns of the data which are a new addition as a feature. This can save a lot of effort that goes into defining the schema.

When to use RDDs?

Consider these scenarios or common use cases for using RDDs:

  • When you want low-level transformation and actions, and control on your dataset.

  • When your data is unstructured, such as media streams or streams of text.

  • When you want to manipulate your data with functional programming constructs than domain specific expressions.

  • When you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column.

  • When you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.

You can seamlessly switch between DataFrame and Dataset and RDDs at will by using simple API method calls, and DataFrames and Datasets are built on top of RDDs.

Few Common Operations on DataFrame:

  • cache() : Works similar to the cache operation in RDD (explained here)

  • collect() : Works similar to the collect operation in RDD (explained here)

  • columns : Returns list of column names or headers.

  • freqItems(cols) :Returns a list of Rows having most frequent items from the specified column.

  • count() : Counts the number of records in a data frame.

  • corr(col1, col2, method=None) : Calculates the correlation of two columns of a DataFrame as a double value. Currently, only supports the Pearson Correlation Coefficient.

  • cov(col1, col2) : Calculates the sample covariance for the given columns, specified by their names, as a double value.

  • describe(*cols) :Computes the descriptive stats for numerical columns in a data frame.

  • explain(extended=False) : Prints the (logical and physical) plans to the console for debugging purpose.

extended – boolean, default False. If False, prints only the physical plan.

  • distinct() : Returns a new dataframe with unique values from a dataframe.

  • drop(col) : Drops a column from the existing data frame and returns a new data frame without the column.

  • dropDuplicates() :Returns a new DataFrame with duplicate rows removed.

  • dropna : Returns a new DataFrame omitting rows with null values.

  • dtypes : Returns all column names and their data types as a list.

  • filter(condition) : Returns a data frame which passes the specified condition.

  • where(condition) : Returns a data frame which passes the specified condition.

  • first() : Same as that used in RDD (explained here). 

  • groupBy(*cols) : Groups the DataFrame using the specified columns, so we can run aggregation like max, min, mean etc. on them.

  • limit(num) : Similar to limit operation in SQL, this operation returns the specified number of records from the data frame.

  • orderBy(*cols, **kwargs) : Sorts the dataframe as per the order specified on a particular column.

  • sort :

  • printSchema() : Displays the schema for the data frame.

  • randomSplit(weights, seed=None) : Same as RDD(explained here).

  • registerTempTable(name) : Registers the data frame as table so that we can use normal SQL operation to query the data frame.

  • schema : Returns a list of spark SQL data types i.e. StructType.

  • select(*cols) : Selects columns from the data frame.

  • show(n=20) : Shows the data frame in a tabular format.

  • toJSON(use_unicode=True) : Converts the data frame into JSON representable string.

  • withColumn(colName, col) : Returns a new data frame by adding a column in the existing data frame. We can also define the transformation for the new column.

These were just a few of the most frequently used APIs of DataDrames for data processing. In our next post, we will be using the RDD and DataFrame APIs to solve a classification problem.

Abhay Kumar

Abhay Kumar, lead Data Scientist – Computer Vision in a startup, is an experienced data scientist specializing in Deep Learning in Computer vision and has worked with a variety of programming languages like Python, Java, Pig, Hive, R, Shell, Javascript and with frameworks like Tensorflow, MXNet, Hadoop, Spark, MapReduce, Numpy, Scikit-learn, and pandas.

One Comment

  1. Pingback: Hot reads for this week in machine learning and deep learning – Everything Artificial Intelligence

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close