Big Data Hadoop & Spark

Beginner’s Guide for Hive

Why was Hive introduced?

Few years ago, Hadoop came into existence for solving queries on huge structured, Semi-structured, and unstructured datasets. Hadoop is considered to be the bestsolution to store and process those huge datasets, because of its advantages like scalability, less infrastructure costs(commodity hardware), data security(replication factor) and MapReduce, which is the best programming structure.

In the beginning big organizations like Facebook, who received huge datasets of raw data on a daily basis, were writing Python scripts to ingest data into the Oracle databases. As the amount of data kept increasing, it was not practical to continue writing lengthy and complex programs. So to handle huge datasets and to process them, Facebook came out with Hive, a language similar to SQL, where there was no need for programmers to write complex programs anymore.

Apache introduced Hive as an open-source and is currently one of the most used Hadoop tools in many organizations.

What is Hive?

Hive is a data warehouse solution built on top of Hadoop for managing huge amount of data stored in HDFS. It provides an SQL-like query language called HiveQL. Hive is designed in such a way that it allows easy data summarization, ad-hoc querying, and analysis of Big Data to Process structured data in Hadoop cluster.

Hive uses HDFS for storage, but stores all the meta information about the database and table in Derby or MySQL database.


Hive is a SQL-like scripting language built on MapReduce According to ANSI, SQL is the standard language for RDMBS, used to communicate with databases
Used for analytics Used for transactional processing(OLTP) & analytics
Data per query in PBs Data per query in GBs
Faster execution while performing analytics on Huge data sets compared to SQL Slower execution while performing analytics on huge data sets compared to HIVE
No Normalization required Supports Normalization


How to install Hive?

To install Hive, please follow the procedure in the below link:

Hive Shell

Hive provides a default interface, where it allows users to run Hive commands. The CLI (Command Line Interface) is the default Hive shell service which allows users to work on Hive programs.

The following are the commands used to start the Hive shell:

Creating a Database

The purpose of creating a database before creating a table is that the database in Hive is a namespace, where a table or collection of tables should be placed to work on Hive queries. Here the database and the schema represents the same category.

The command to create a database are as follows:


Listing Databases

The show databases command displays the list of databases in HDFS so that the user can select the database he wants to work with.


Using a Database

The command to use database is USE. The use database command allows the user to utilize the mentioned database to work with, where he can create new tables in that database and alter particular table contents too.

hive> USE EMPdb;

Creating a Table

The command for creating a table is ‘Create’.

Before creating a table, a user must know the following details:

  • The create table command allows the user to create a new table with user input attributes/columns.
  • Row format delimited Fields terminated by ‘\t’ – This line informsHive that each column in the file is separated by a tab.

hive> Create table emp(id int, name string, sal float)

        > row format delimited

        > fields terminated by ‘\t’ ;

List Tables

The ‘show tables’ command displays the list of tables present in a particular database.

Hive> show Tables;

Once the table names are displayed, the user can select a particular table and work on it. The user can also create a new table and start processing the required query.

Describe Schema of the Table

The describe command is similar to adesc command in SQL. The describe command displays table definition/metadata of the mentioned table and the users can see column names and data types associated to that particular column.

Hive>DESCRIBE emp_table;


Load a File from the Local File System

We know that Hadoop is designed to work only on huge datasets instead of few selected rows/columns. So, the user needs to loadthe entire file to HDFS and process it using Hive commands.

The command used to load file from local system to HDFS is as follows:

Hive>load data local inpath<filename> into table<tablename>

Load File from HDFS

The command used for loading file from HDFS is ‘load data’.

Hive>load data inpath<filename> into table<tablename>

Show Table Contents

Once the selected file is loaded to HDFS,the contents of the table can be checked using the select statement.

The select * statement of a table displays all the columns which are present in that table.

Commonly Used Alter Commands

Some of the commonly used alter commands in Hive are as follows:

Renaming the Current Table:


Adding New Columns to an Existing Table:


Truncating a Table

The Truncating table command allows users to delete only the content values in the table and not the table schema.

After performing the Truncating command, we can see that the table schema is unaffected but the contents of the table have been deleted. This is how you can truncate/delete table contents without affecting the table schema.

Dropping a Database

The command to drop a database is ‘drop database’. The drop database database_name allows users to drop a particular database from the HDFS memory.


You can send us your queries and feedback in the comment section below. Keep visiting our website Acadgild for more updates on Big Data and other technologies. Click here to learn Big Data Hadoop Development.



is working with AcadGild as Big Data Engineer and is a Big Data enthusiast with 2+ years of experience in Hadoop Development. He is passionate about coding in Hive, Spark, Scala. Feel free to contact him at [email protected] for any further queries.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles