All CategoriesBig Data Hadoop & Spark - Advanced

Running Hive Queries Using Spark SQL

So far we have seen running Spark SQL queries on RDDs. But you can also run Hive queries using Spark SQL. This instructional blog post explores how it can be done.

For executing the steps mentioned in this post, you will need the following configurations and installations:

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

  1. Hadoop cluster configured in your system

  2. Hive installed and configured with Hadoop

  3. Spark installed on the top of Hadoop eco-system

Please follow the following links for the Hadoop, Hive, and Spark setup.

  • For the Hadoop cluster setup, follow the instructions mentioned in the document given below:

https://acadgild.com/blog/hadoop-2-x-installation-guide/

  • For Hive installation, follow the instructions mentioned in document in the document given below:

https://drive.google.com/open?id=0B1QaXx7tpw3SMDktdk5RbHk1MVE

  • For Spark installation, follow the instructions mentioned in the link below:

https://acadgild.com/blog/beginners-guide-for-spark/

  • If you want to configure the MySQL Metastore for Hive, follow the post mentioned below:

https://acadgild.com/blog/mysql-metastore-integration-with-hive/

*Note: In this tutorial, we have configured the Hive Metastore as MySQL.

Some more configurations need to be done after the successful configuration of Hadoop, Hive, and Spark.

Open your spark-env.sh file which is present in the $SPARK_HOME/conf directory and open the spark-env.sh file. Here, add the HIVE_HOME as shown below.

export HIVE_HOME=/path_to_hive_installed_directory

Now copy the hive-site.xml file present in the $HIVE_HOME/conf directory to the $SPARK_HOME/conf directory.

Now, we will check for the tables inside Hive first. Let’s start the Hive shell first.

*Note: Make sure that the Hive Metastore service is running. Next, start it using the command hive –service metastore

In the above screenshot, you can see the list of tables present in our Hive. Here we have loaded the Olympics data. We have created queries in the first 5 rows of the Olympics table using the following command.

select * from olympics limit 5;

Now let us check for the same by querying from the Spark Shell.

HadoopNow, start the Spark Shell and import the Hive context. Create one variable for the HiveContext as follows:

import org.apache.spark.sql.hive.HiveContext
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext._

You can see the same in the following screen shot.

Now let us see the list of tables present in Hive from the Spark Shell. Use the following command to view the list.

val hive_tables=hiveContext.sql("show tables").foreach(println)

Now let us query on the olympics table from the Spark Shell.

val first_5_rows = hiveContext.sql("select * from olympics limit 5").foreach(println)

In the above screen shot, you can see the first 5 rows of the dataset. Like this, we can execute any kind of queries on Hive data using the Spark-SQL engine.

Let us now create one table in spark shell and let us see whether will it get reflected in the hive or not.

Following is the way you can create a table in Hive through Spark Shell.

val create_table = hiveContext.sql("create table spark_olympic(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT) row format delimited fields terminated by '\t' stored as textfile")

We have created a table with the name spark_olympic but we haven’t loaded any data into this table.

Let us see whether it will get reflected in the Hive shell or not.

In the above screenshot, you can see the table spark_olympic is in the Hive Shell and there is no data in it.

Now, let us load the data from the Spark Shell into this table and query the data from Hive.

This is how we can load data into a Hive table from a Spark Shell.

val load_data = hiveContext.sql("load data local inpath '/home/kiran/Desktop/olympix_data.csv' into table spark_olympic")

Now let us query the data in the Hive Shell from Spark. In the above screen shot, you can see that we have queries that recently loaded data.

We can run all the Hive queries on the Hive tables using the Spark SQL engine. We can also execute hive UDF’s, UDAF’s, and UDTF’s also by using the Spark SQL engine.

We hope this blog helped you understand running queries on Hive data using Spark SQL engine. Keep visiting our site www.acadgild.com for more updates.

Suggested Reading

Spark SQL Hive Example

Related Popular Courses:

BIG DATA COURSES

ANDROID DEVELOPER CERTIFICATION BY GOOGLE

KAFKA TUTORIAL FOR BEGINNERS

BEST DATA SCIENCE ONLINE COURSES

COURSES FOR DATA ANALYTICS

Spark

3 Comments

  1. My guess would be to read the contents of the “sql file” into a variable and pass the variable to hiveContext.sql(). Please correct me Krishna if I am wrong.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close