This blog post illustrates an industry scenario there a collaborative involvement of Spark SQL with HDFS, Hive, and other components of the Hadoop ecosystem.
Spark is perhaps is in practice extensively, in comparison with Hive in the industry these days. Nevertheless, Hive still has a strong foothold, and those who work with Spark SQL and structured data, still use Hive tables to a large extent.
On the contrary, Hive has certain drawbacks. Initially, due to MapReduce jobs underneath, this process is slow. Secondly, it is only suitable for batch processing, and not for interactive queries or iterative jobs. Spark SQL, on the other hand, addresses these issues remarkably well. We can directly access Hive tables on Spark SQL and use SQLContext queries or DataFrame APIs to work on those tables. The process is fast and highly efficient compared to Hive.
Why do we access Hive tables on Spark SQL and convert them into DataFrames?
The answer is simple. For large scale projects working with petabytes of data, It is possible to reduce the time taken to process the aggregate function, if we can execute complex queries in an efficient way. Hence, we use Spark SQL, which has an in-built catalyst optimizer that processes all types of queries at a faster pace. It even allows the uage of external DataFrames with Hive tables for purposes such as join, cogroup, etc.
In this blog post, we can understand see: How we can access Hive tables on Spark SQL; How to perform collaborative operations on Hive tables and external DataFrames, and some other aggregate functions.
Establishing Connection Between Hive and Spark SQL
Let us begin by connecting Hive to Spark SQL. We can execute this by following the steps below:
Step 1: Move hive-site.xml from $HIVE_HOME/conf/hive-site.xml to $SPARK_HOME/conf/hive-site.xml. Make an entry regarding hive metastore uris in this file. The entry will look like this:
Step 2: Extract all the dependencies for required Spark components (in this case Spark SQL and Hive) in the build.sbt file.
Step 3: Start all Hadoop processes in the cluster. Verify the following:
Step 4: Start MySQL because Hive needs it to connect to the metastore and because Spark SQL will also need it when it connects to Hive
Step 5: Run the Hive metastore process so that when Spark SQL runs, it can connect to metastore uris and take from it the hive-site.xml file mentioned in the first step.
The steps above are to configure Hive and Spark SQL so that they can work together. Now we shall discuss Spark SQL code to see how it connects to Hive.
Spark SQL Coding
Let us start by creating a Spark session. In Spark 2+ this includes SparkContext and SQLContext.
Here, we provide the path to hive.warehouse.dir, which is /user/hive/warehouse on HDFS, as the path to spark.sql.warehouse.dir.
Then, since Spark SQL connects to Hive metastore using thrift, we need to provide the thrift server uri while creating the Spark session. This information is present in hive-site.xml (from step 1), which will be sent to $SAPRK_HOME/conf.
Now, we are ready to access Hive’s databases (default and other), and the cascaded tables stored in them.
Before that, let us list all the databases starting with “d” using the following code:
When we deploy the code to the cluster, it generates the following output:
In order to check the connection between Spark SQL and Hive metastore, the verification of the list of Hive databases and tables using Hive prompt could be done.
Thus, there is successful establishement of connection between Spark SQL and Hive.
To create a Hive table using Spark SQL, we can use the following code:
When the jar submission is done and we execute the above query, there shall be a creation of a table by name “spark_employee” in Hive. Once again, we can use Hive prompt to verify this.
Now that we have a database with the project name and a table with titles as“enriched_data” in it, we can extract values from it using the following code:
The Output of Spark Console
Output validated on Hive prompt: –
Let us now create a local DataFrame and combine it with this table to see what is the output. We can create a local DataFrame with dummy data using the following code:
We can join the firstSourceDF with the secondSourceDF in the following way:
The output (First printSchema) is as follows:
As we can see in the image, the id and name are from the first DataFrame, while the rest of the fields are from the second. It is proof that there exists a combination between two DataFrames. We can see the result below:
The code snippets and Hive queries in this blog post portray that Spark SQL can connect to Hive tables and carry out all kinds of analyses. The last example showcase that Spark SQL is even capable of joining Hive tables to locally create DataFrames.
Finally, to run the program, we need to follow these steps:
- Save the program as SparkPlusHive.scala after writing it.
- Go to the location of build.sbt file and execute the “sbt compile” and “sbt package” commands.
The command will create a jar in /home/acadgild/IdeaProjects/SparkAndHive/project/target/scala-2.11:
Execute spark-submit command to run the jar. The command is as follows:
Hope this blog helped you getting some insights about Hive Tables and Spark SQL, Stay tuned for more blogs and enroll for Big Data and Hadoop training with Acadgild.