Are you fed up of using HIVE with lot of incomplete jobs? OR have lost output? OR Waiting too long to run another job ?
Well then the answer to all these questions and many more is this Blog which will help you in setting up MYSQL database connectivity with Hive and give you the flexibility required to use multiple Hive jobs running at the same time.
100% Free Course On Big Data Essentials
Subscribe to our blog and get access to this course ABSOLUTELY FREE.
Before I go in-depth let us first understand what Hive does.
Hive stores the metadata related to tables and databases into the external RDBMS like Apache Derby or MYSQL and metadata.
Now let us understand what the terms – meta store and database mean?
- The megastore service provides the interface to the Hive.
- The database stores the data definitions and mappings to the data.
The megastore (which consists of services and database) can be configured in different ways. Embedded Apache Derby is used as the default Hive megastore in the Hive configuration. This configuration is called embedded meta store and is good for the sake of development and unit testing, but won’t scale to a production environment as only a single user can connect to the derby database at any instant of time. Starting second instance of the Hive driver will throw an error message.
So what is Apache Derby?
Apache Derby, an Apache DB subproject, is an open source relational database implemented entirely in Java. Some key features include:
- Derby is based on the Java, JDBC, and SQL standards.
- Derby provides an embedded JDBC driver that lets you embed Derby in any Java-based solution.
- Derby also supports the more familiar client/server mode with the Derby Network Client JDBC driver and Derby Network Server.
- Derby is easy to install, deploy, and use.
Most importantly Derby is single instance database, which means only one user can access the derby instance at one time and this had been a motivational factor to include Mysql as the default meta store.
Advantages Of using Mysql as a megastore in Hive-
- It is Stable
- It keeps a track of metadata.
- It can support multiple instances of Hive.
Prerequisite:- Hive should be installed with Hadoop daemons running on Centos operating system.
In order to change the default meta store from Derby to Mysql we need to change the property in Hive-site.xml.
Since Hive-0.10, we get only hive-default.xml. We need to explicitly create Hive-site.xml to override the default property containing the configuration of Apache Derby.
- Use the below link to download Hive-site.xml and MYSQL-connector-java-5.1.2.jar.
- Inside $HIVE_HOME/conf directory you can directly download Hive-site.xml.
- MYSQL driver to connect with Hive, should be placed inside $HIVE_HOME/lib directory.
- Once the Hadoop daemons are installed and running give the following command
sudo service mysqld start
- Log-in to MYSQL with root user and create an account.
mysql -u root
NOTE: no password for root//by default it is null
This will give privileges for root user to be safe(security purpose).
create user [email protected] identified by ‘password’;
- Giving permissions to user using command.
grant all on *.* to [email protected] identified by ‘password’ with grant option;
- Followed by
This command simply reloads the grant tables in your MySQL database by enabling the changes to take effect without stopping and restarting MySQL.
- Log-in by the user created just now.
mysql -u acadgilduser -p
Give password: password
- On another terminal open Hive, and create database by command.
Create database AcadGild1;
You can check whether the Hive is in synchronization with MYSQL.
10. Log into root of MYSQL and go to metastore in databases.
11. Inside metastore database, list all tables.
See all contents inside DBS. This is where you can find metadata created by Hive.
Now your Hive and MYSQL is completely synchronized.
And You can use multiple Hive instance in system.