So far we have seen running Spark SQL queries on RDDs. But you can also run Hive queries using Spark SQL. This instructional blog post explores how it can be done.
For executing the steps mentioned in this post, you will need the following configurations and installations:
Hadoop cluster configured in your system
Hive installed and configured with Hadoop
Spark installed on the top of Hadoop eco-system
Please follow the following links for the Hadoop, Hive, and Spark setup.
For the Hadoop cluster setup, follow the instructions mentioned in the document given below:
For Hive installation, follow the instructions mentioned in document in the document given below:
For Spark installation, follow the instructions mentioned in the link below:
If you want to configure the MySQL Metastore for Hive, follow the post mentioned below:
*Note: In this tutorial, we have configured the Hive Metastore as MySQL.
Some more configurations need to be done after the successful configuration of Hadoop, Hive, and Spark.
Open your spark-env.sh file which is present in the $SPARK_HOME/conf directory and open the spark-env.sh file. Here, add the HIVE_HOME as shown below.
Now copy the hive-site.xml file present in the $HIVE_HOME/conf directory to the $SPARK_HOME/conf directory.
Now, we will check for the tables inside Hive first. Let’s start the Hive shell first.
*Note: Make sure that the Hive Metastore service is running. Next, start it using the command hive –service metastore
In the above screenshot, you can see the list of tables present in our Hive. Here we have loaded the Olympics data. We have created queries in the first 5 rows of the Olympics table using the following command.
select * from olympics limit 5;
Now let us check for the same by querying from the Spark Shell.
import org.apache.spark.sql.hive.HiveContext val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._
You can see the same in the following screen shot.
Now let us see the list of tables present in Hive from the Spark Shell. Use the following command to view the list.
val hive_tables=hiveContext.sql("show tables").foreach(println)
Now let us query on the olympics table from the Spark Shell.
val first_5_rows = hiveContext.sql("select * from olympics limit 5").foreach(println)
In the above screen shot, you can see the first 5 rows of the dataset. Like this, we can execute any kind of queries on Hive data using the Spark-SQL engine.
Let us now create one table in spark shell and let us see whether will it get reflected in the hive or not.
Following is the way you can create a table in Hive through Spark Shell.
val create_table = hiveContext.sql("create table spark_olympic(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT) row format delimited fields terminated by '\t' stored as textfile")
We have created a table with the name spark_olympic but we haven’t loaded any data into this table.
Let us see whether it will get reflected in the Hive shell or not.
In the above screenshot, you can see the table spark_olympic is in the Hive Shell and there is no data in it.
Now, let us load the data from the Spark Shell into this table and query the data from Hive.
This is how we can load data into a Hive table from a Spark Shell.
val load_data = hiveContext.sql("load data local inpath '/home/kiran/Desktop/olympix_data.csv' into table spark_olympic")
Now let us query the data in the Hive Shell from Spark. In the above screen shot, you can see that we have queries that recently loaded data.
We can run all the Hive queries on the Hive tables using the Spark SQL engine. We can also execute hive UDF’s, UDAF’s, and UDTF’s also by using the Spark SQL engine.
We hope this blog helped you understand running queries on Hive data using Spark SQL engine. Keep visiting our site www.acadgild.com for more updates.