Big Data Hadoop & Spark

Integrating SparkSQL with MySQL

In this post, we will be learning how to connect to a JDBC data-source using SparkSQL data frames. In case you are not familiar with SparkSQL, you can refer to this post for a comprehensive Introduction to SparkSQL and the post on Analyzing Crime Data using SparkSQL.
We know that DataFrame API was introduced in Spark 1.3 and that Spark RDDs are a unit of computation in Spark and doesn’t contain any schema information about the structure of the data. Dataframes combine RDDs with Schema and this small addition makes them very powerful.
For demonstration purpose, we will create a table with historical stock data in MySQL.
You can click here to download the dataset. 

Creating a Table in MySQL:

creating table
 

Loading the Data into MySQL using Load data in File Command:

load data
 

Checking the Records:

checking records
Add MySQL connector/driver to the pyspark path shell using the driver-class-path argument.
To download MySQL connector Click here
Hadoop
 

Logging into pyspark Shell:

logging to pyspark shell

Creating the URL String with MySQL Login Details:

Note: The URL should contain three parts – 1. database URL 2. user name 3. password (in this case, I don’t have any root user password so I’m not mentioning it.)
checking url

Checking the Type of stock_data Variable:

checking stock data
We can access the data from the created dataframe in two ways:

  1. Using DataFrame methods.
  2. We can register the Dataframe as a Table and then query it.

Let’s look at each of them in detail.

Using DataFrame Methods:

To check the schema, we can use the PrintSchema method.
using Dataframe method
Select the symbols from stock_data.
select symbols for stock dataOutput:
symbols for stock data

Registering DataFrame as a Table:

First, register the DataFrame as a table so that we can query the table normally.
registering dataframe as a table
Query 1: Select maximum volumes for stock symbol A.
query 1
Output:
output 1
Query 2: Select maximum value of volumes for each stock symbol.
query 2
Output:
output 2

We hope this blog helped you in understanding the integration of Mysql and Spark Sql and performing the basic operations on them.

Keep visiting our website acadgild.com more blogs on big data and other technologies.
Spark

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles

Close