In this post, we will be learning how to connect to a JDBC data-source using SparkSQL data frames. In case you are not familiar with SparkSQL, you can refer to this post for a comprehensive Introduction to SparkSQL and the post on Analyzing Crime Data using SparkSQL.
We know that DataFrame API was introduced in Spark 1.3 and that Spark RDDs are a unit of computation in Spark and doesn’t contain any schema information about the structure of the data. Dataframes combine RDDs with Schema and this small addition makes them very powerful.
For demonstration purpose, we will create a table with historical stock data in MySQL.
You can click here to download the dataset.
Creating a Table in MySQL:
Loading the Data into MySQL using Load data in File Command:
Checking the Records:
Add MySQL connector/driver to the pyspark path shell using the driver-class-path argument.
To download MySQL connector Click here
Logging into pyspark Shell:
Creating the URL String with MySQL Login Details:
Note: The URL should contain three parts – 1. database URL 2. user name 3. password (in this case, I don’t have any root user password so I’m not mentioning it.)
Checking the Type of stock_data Variable:
We can access the data from the created dataframe in two ways:
- Using DataFrame methods.
- We can register the Dataframe as a Table and then query it.
Let’s look at each of them in detail.
Using DataFrame Methods:
To check the schema, we can use the PrintSchema method.
Select the symbols from stock_data.
Registering DataFrame as a Table:
First, register the DataFrame as a table so that we can query the table normally.
Query 1: Select maximum volumes for stock symbol A.
Query 2: Select maximum value of volumes for each stock symbol.
We hope this blog helped you in understanding the integration of Mysql and Spark Sql and performing the basic operations on them.
Keep visiting our website acadgild.com more blogs on big data and other technologies.