n this blog, we will be discussing the ways by which Apache Presto can be integrated with Hive and how SQL queries can be performed on Hive using Presto.
What is Presto SQL?
Presto is an open source distributed SQL query engine for big data processing. It is used by many organizations, like Facebook, because of its scalability and interactive analytical approaches.
What is Presto Used For?
Presto can be used as an SQL query engine on top of Hive, Cassandra, relational databases, and much more. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.
Installing Presto & Integrating with Hive:
Now let us see how to install Presto, and how it can be integrated with Hive.
- Download Presto from the following link:
https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.152/presto-server-0.152.tar.gz
- Untar using the following command:
tar -xvzf presto-server-0.152.tar.gz
- Create a directory to store Presto’s data using the following command:
mkdir presto_data
- presto_data directory will be created in the $HOME directory.
Configurations for installing Presto
- Create an etc directory inside the installation directory. This will hold the following configuration:
- Node Properties: Environmental configuration specific to each node
- JVM Config: Command line options for the Java Virtual Machine
- Config Properties: Configuration for the Presto server
- Catalog Properties: Configuration for Connectors (data sources)
- Inside the etc folder, create a file with name “node.properties” wherein you need to give the following information:
- node.environment=production
- node.id=presto1
- node.data-dir=/home/kiran/presto_data (Here you need to mention the directory that you have created for storing Presto’s data.)
-
- Inside the etc folder itself, create a file with name “jvm.config” to hold the jvm configurations wherein you need to give the below information:
- server
-
-
- Xmx16G
- XX:+UseG1GC
-
- XX:G1HeapRegionSize=32M
- XX:+UseGCOverheadLimit
- XX:+ExplicitGCInvokesConcurrent
- XX:+HeapDumpOnOutOfMemoryError
- XX:OnOutOfMemoryError=kill -9 %p
-
- Inside the etc folder itself create one more file with the name “config.properties” where you need to provide the configuration for a single node cluster.
- coordinator=true
- node-scheduler.include-coordinator=true
- http-server.http.port=8080
- query.max-memory=5GB
- query.max-memory-per-node=1GB
- discovery-server.enabled=true
- discovery.uri=http://localhost:8080
-
- For connectors, you need to create one more directory inside the etc folder with the name “catalog”. Inside the catalog if you are connecting Presto with Hive, then you need to create a file with name “hive.properties” where you need to add the following information.
- connector.name=hive-hadoop2
- hive.metastore.uri=thrift://localhost:9083 (Here you need to provide the hive metastore uri)
- hive.config.resources=/home/kiran/hadoop-2.7.1/etc/hadoop/hdfs-site.xml
Note: To set hive metastore uri, add the following property in your hive-site.xml file
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value>
<description> URI for client to contact metastore server </description>
</property>
You can give the port a number of your choice.
- In the catalog directory itself, create one more file with the name “jmx.properties” and give the following property
- connector.name=jmx
Now the whole directory structure will look as following:
- Set Presto home in bashrc
- #set Presto_Home
- export PRESTO_HOME=/home/kiran/presto-server-0.152
- export PATH=$PATH:$PRESTO_HOME/bin
- After setting the properties, save and close the file and update the bashrc file using the command “source .bashrc”.
- Now from your terminal home, use either the command of “launcher start” or “launcher run” to start Presto.
CLI
- Download presto-cli jar from the below link
- https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.152/presto-cli-0.152-executable.jar
- Rename it as presto and make it an executable jar file by using the command “chmod +x path_to_presto”
- Now run the following command to enter into Presto CLI.
presto –server localhost:8080 –catalog hive –schema default
- To query Hive tables using presto, you need to start the Hive metastore service in one terminal which should keep on running.
- Start Hive metastore service by using the following command:
- hive –service metastore -p 9083
In the above screenshot, we have started the Hive metastore service, which is on the left-hand side, and on the right-hand side we have started the Presto server and the Presto CLI.
With the presto CLI we have also performed one query to check the list of tables in Hive.
- Now we will check in Hive for the same tables.
-
- In the above screenshot, the same list of tables has been displayed on the Hive console also.
- Let us perform one simple query on the Olympic table present in Hive. Now we will find out the average age of athletes from the Olympic table. You can see the result in the following screenshot.
We hope this blog post helped you in understanding Presto better, and how to integrate it with Hive and perform SQL kind of queries on Hive using Presto.
Keep visiting our website, www.acadgild.com, for more updates on big data and other technologies. Click here to learn Big Data online.