In our previous blog posts, we have discussed a brief introduction on Apache hive with its DDL commands, so a user will know how data is defined and should reside in a database from our previous posts.
If a user is working on hive projects, then the user must know its architecture, components of the hive, how hive internally interacts with Hadoop and other important characteristics.
In this blog, you will learn the following:
What is Hive?
Important Characteristics Of Apache Hive
How To Process Data With Apache Hive?
So let’s get started with what is hive?
So, What is Hive?
Hive is a data warehousing tool that is built on top of the Hadoop distributed file system (HDFS).
Hive makes the job easy for performing operations like
- Data Encapsulation.
- Ad-hoc Queries.
- Data analysis of large amounts of datasets.
Important Characteristics Of Apache Hive
- In the hive, data is loaded after the creation of a table and databases. Hive is a data warehouse engine, So first, the user should create the tables and then load the data to submit the queries on top of tables.
- Hive does the query optimization which refers to an effective way of query execution in terms of performance Like, file formatting, partitioning, bucketing (share the blog links).
- Hive SQL is an advantage to the user because of the complexity of MapReduce.
- Hive concept is from relational database worlds, such as rows, columns, and tables which is easy to learn and understand.
- A new and important component of Hive i.e.Metastore used for storing schema information. Hive metastore stores the schema information of the tables and relations in the Hive “Meta storage database”.
- Hive queries can be executed using the hive command-line interface, hive query language (.hql) scripts and using java user-defined function.
- Hive supports various file formats like TEXTFILE, SEQUENCE FILE, ORC and RCFILE (Record Columnar File).
Apache Hive Architecture?
As you can see from the above diagram it shows you the hive architecture and its components.
Hive uses the concept of MapReduce internally for job execution.
The major components of the apache hive
- Hive Client
- Hive Service
- Processing Framework And Resource Management
- Distributed Storage
These are the main components of apache hive and we are going to discuss it in detail in the next section.
Hive Client :
Users can easily write the hive client applications written in the language of their choice. hive supports all the applications written in languages like java, python, etc.using JDBC driver, thrift and ODBC driver.
These clients are categorized into 3 types.
Apache hive server is based on thrift so it can serve the request from all those languages that support thrift.
Apache hive allows Java applications to connect to it using a JDBC driver.
ODBC driver allows applications that support the ODBC protocol to connect to the hive.
Hive provides different kinds of services like Web User Interface, Command-line Interface (CLI) to perform the queries on data.
- CLI(Command Line Interface): Hive provides default shell where you can execute your queries commands and hive jobs.
- Web Interface(WebUI): Hive also provides the web user interface for executing queries and commands.
- Hive Server: Hive server also known as a thrift server. It allows different clients to submit the application or request to the hive and retrieves the data or results.
- Hive Driver: Hive the client submits the queries through the thrift server, JDBC, ODBC, CLI which will be received by the diver.
- Metastore: Hive maintains the central repository for the metadata of the tables(Schema and location), databases which stores in the metastore so metastore is nothing but the central repository in hive architecture.
How apache hive process the data?
Now we will discuss how a query executes in the hive.
- User Interface or CLI submits the query which will be received by the driver
- The driver will process that request and send it to the compiler to generate an execution plan.
- The compiler needs the metadata so it will request the metastore to get the details and then the compiler receives the metadata.
- Now the compiler uses the metadata to check expression in the query and then the compiler generates the DAG plan which contains the stages.stages can be map stage and reduce stage.
- Now execution engine submits these stages to the appropriate component.
- Stages contain multiple tasks once the appropriate component done with the task it generates the output and writes it to the temporary HDFS file through the serializer. Then it moves the final temporary file to tables location where DML operations performed.
- Now execution engine directly reads the contents of the temporary files from HDFS as a part of a fetch call from the driver.
So, this is the approach of how a hive query is executed.
Hive is an ETL or warehousing tool to analyze and process a large amount of data built on top of Hadoop. It provides a simple way of the query language like HQL for querying and processing the data. As you have learned the apache hive architecture and its components let’s Learn How To Install The Hive On Ubuntu to get hands-on.
For any further queries please share your views through your comments. Happy Learning 🙂