This blog illustrates, how to work on data in MySQL using Spark.
Spark provides data source APIs to connect to a database. Using the data source APIs, we can load data from a database and consequently work on Spark.
Here, we will be using the JDBC data source API to fetch data from MySQL into Spark using DataFrames.As data in a database will be in the form of tables, we will need to use Spark DataFrames to work on the data.
Let us create a database and table in MySQL shell.
Login to your MySQL shell.
Here we have used the root user, and the password to enter into the MySQL shell is as follows:
mysql -u root -p ******
- Create a database by using the following command:
create database Acadgild;
To work in the created database, use the following command:
- After we get access to the AcadGild database, a table for “employee” is created using the following command:
create table employee(emp_id INT NOT NULL AUTO_INCREMENT,emp_name VARCHAR(100),emp_sal INT,PRIMARY KEY(emp_id));
- A table has been created with the name employee and columns emp_id, emp_name, and emp_sal. The scheme of this table can be checked using the following command:
- Let us try inserting some sample data into the table created by using the following command:
INSERT INTO employee(emp_id,emp_name,emp_sal) VALUES (1,'kk',50000), (4,'onk',6000),(7,'mj',70000);
- We can check for the inserted data using the following command:
select * from employee;
We have successfully created a table in MySQL. Now let us check out this data in Spark.
- For this, use mysql-connector jar in the Spark classpath.
- For that download the MySQL Connector JAR file from the following link:
- Download the JAR file and copy it into your Spark’s lib folder.
- Now start the Spark shell using the command:
Constructing JDBC URL
- You need to provide the ip_address and then the port number where MySQL is running. In our case, it is just the localhost. Otherwise, it needs to be provided in the following format:
ip_addrs:port_num or localhost:port_num
- Now, you need to provide the MySQL connection properties to spark i.e., the username and password.
val mysql_props = new java.util.Properties mysql_props.setProperty("user","root") mysql_props.setProperty("password","kiran0541")
- We have now successfully given the JDBC its properties. Now we need to load the DataFrame with the JDBC data source.
- The JDBC data source will be in the following format:
(connection_url_variable, table name, properties_variable) val employee = sqlContext.read.jdbc(connection,"employee",mysql_props)
- Now we have successfully loaded the DataFrame employee with the data in the table employee which is in the MySQL database.
- You can see the contents of the employee table using the DataFrame employee by using the following command:
- On this DataFrame, you can perform all the operations that can be performed on Spark. You can register the DataFrame as a table by using the following command:
A table will be created with the name, “employee_table.”
Now you can perform all the SQL queries on this table.
val check_data = sqlContext.sql("select * from employee_table")
- With the above command, a DataFrame will be created and you can use the show command to display the table data.
The whole stack trace is shown in the screenshot below, where you can see that the data in MySQL has been loaded into the Spark DataFrame successfully. Any kinds of operations can be performed on this data.
We hope this blog helped you in understanding the concept in-depth.
Keep visiting our site www.acadgild.com for more updates on big data and other technologies.