Big Data Hadoop & Spark

Working With Hive Complex Data Types

In this blog, we will discuss the working of complex Hive data types. Before we move ahead you can go through the below link blogs to gain more knowledge on Hive and its working.

What are Complex Data Types in Hive?

Complex Data types are also known as nested types which allow storing multiple values within a single row/column position in a Hive table. Generally, in Hive and other databases, we have more experience on working with primitive data types like:
Numeric Types

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

  • TINYINT : 1-byte signed integer, from -128 to 127
  • SMALLINT : 2-byte signed integer, from -32,768 to 32,767
  • INT : 4-byte signed integer, from -2,147,483,648 to 2,147,483,647
  • BIGINT : 8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
  • FLOAT : 4-byte single precision floating point number
  • DOUBLE : 8-byte double precision floating point number
  • DECIMAL : Hive 0.13.0 introduced user definable precision and scale

String Types

  • STRING
  • VARCHAR
  • CHAR

Date/Time Types

  • TIMESTAMP
  • DATE

Misc Types

  • BOOLEAN
  • BINARY

Apart from these primitive data types, Hive offers some complex data types which are as listed below:

Complex Data Types

  • arrays: ARRAY<data_type>
  • maps: MAP<primitive_type, data_type>
  • structs: STRUCT<col_name : data_type [COMMENT col_comment], …>

Thus, let us know what are these complex data types and their working in Hive.
So, in our example, we will be using our Hive default database to store the complex data type tables.  

Array:

The first complex type is an array. It is nothing but a collection of items of similar data type. i.e, an array can contain one or more values of the same data type.
In our Array example, we will be using the dataset Temperature.txt where the fields are delimited by tab and the complex data type Array values are delimited by the comma.

Dataset_Temperature

Dataset Description

Date
State
Temperature [Depending on their district level wise]
So, let us create a table to store the above values of dataset Temperature using below code.
create table Temperature(date string,city string,MyTemp array<double>) row format delimited fields terminated by ‘\t’ collection items terminated by ‘,’;

We can observe, in the above code, we are creating an array named Myarray which will hold only the double type values.

To check the schema of the table Temperature we can use the command describe Temperature;

Now, let us load our input dataset Temperature.txt using the below command.
Load data local inpath ‘/home/acadgild/Desktop/Temperature.txt’ into table Temperature;

We can view the contents of the table using the command select * from Temperature;

We can observe, from the above image all the array values are collected and stored in a single column.

To select a column and a value from the table we can use the below command.
select city,MyTemp[0] from Temperature;

We can observe from the above image; we are selecting the column state and the array MyTemp Zeroth position values.
select date,city,MyTemp[3] from Temperature;

We can observe from the above image; we are selecting the column state and the array MyTemp 4th position values.
So, we can follow the above steps to work with complex data type array values in Hive.

Now, let us know what is Map and its working.

Map:

Map is a collection of key-value pairs where fields are accessed using array notation of keys
Eg: [‘Key’]

In our Map example we will be using the dataset Schools.txt where the fields are delimited by tab, the complex type Map values are delimited by the comma.

Dataset_School_Data

Dataset Description
School Type
State
Gender
Total

So, let us create a table to store the above values of dataset Schools using the below code.
create table MySchools(schooltype string,state string,gender string, total map<int,int>) row format delimited fields terminated by ‘\t’ collection items terminated by ‘,’ map keys terminated by ‘:’;

We can observe, in the above code, we are creating a collection named total which will hold the Values of type int and int.

To check the schema of the table MySchools we can use the command describe MySchools;

Now, let us load our input dataset School_Data.txt using the below command.
load data local inpath ‘/home/acadgild/Desktop/School_Data.txt’ into table MySchools;

We can view the contents of the table using the command select * from MySchools;

We can observe, from the above image each Map total column contains to data i.e, key and value.
Here, year represents the key and the subsequent data represents the value.

To select a column and its key value from the table we can use the below command.
select total[2016] from MySchools where state=’Assam’;

We can observe from the above image, we are selecting the column state = ‘Assam’ to find the male and female strength in the year 2016.
select total[2017] from MySchools where state=’Chhattisgarh’ and gender=’Female’;

We can observe from the above image; we are selecting the column state = ‘chhattisgarh’ to find the female strength in the year 2017.

So, we can follow the above steps to work on a collection Map Key value pairs in Hive.

Now, let us know what is Struct and its working.

Struct:

Struct is a record type which encapsulates a set of named fields that can be any primitive data type. An element in STRUCT type can be accessed using the DOT (.) notation.
In our Struct example, we will be using the dataset Bikes.txt where the fields are delimited by tab and the complex type Array values are delimited by the comma.

Dataset_Bikes

Dataset Description
Name
BikeFeatures struct<EngineType,cc,power,gears>

So, let us create a table to store the above values of dataset Bikes using below code.
create table MyBikes(name string, BikeFeatures struct<EngineType:string,cc:float,power:float,gears:int>) row format delimited fields terminated by ‘\t’ collection items terminated by ‘,’;

We can observe, in the above code, we are creating a collection Struct named BikeFeatures which will hold string, float, float, and int value types.

To check the schema of the table MySchools we can use the command describe MyBikes;

Now, let us load our input dataset Bikes.txt using the below command.
load data local inpath ‘/home/acadgild/Desktop/Bikes.txt’ into table MyBikes;

We can view the contents of the table using the command select * from MyBikes;

We can observe, from the above image the Struct BikeFeatures column contains multiple values of different types.

To select a Struct column values from the table we can use the below command.
select BikeFeatures.EngineType from MyBikes;

To select a column and its Struct column values from the table we can use the below command.
We can observe from the above image we are using Dot (.) operator to access Struct Bike Features Engine

Types column values from the table.
select BikeFeatures.EngineType from MyBikes where name=’Suzuki Swish’;

We can observe from the above image; we are selecting the column name where Bike name = ‘Suzuki Swish’ to find its EngineType.

So, we can follow the above steps to work on collection Struct values in Hive & other Hive data types.

We hope this post has clarified the concept of Bucketing in Hive.
Enroll for Big Data and Hadoop Training conducted by Acadgild and become a successful big data developer.

Related Popular Courses:

BIG DATA CERTIFICATION

ANDROID ONLINE TRAINING INDIA

WHAT IS APACHE KAFKA

DATA SCIENCE COURSES ONLINE

ANALYTICS COURSE

 

Manjunath

is working with AcadGild as Big Data Engineer and is a Big Data enthusiast with 2+ years of experience in Hadoop Development. He is passionate about coding in Hive, Spark, Scala. Feel free to contact him at [email protected] for any further queries.

4 Comments

  1. This article is awesome. I was confusing about the map, array and get everything clear by reading these samples. Thank you so much for sharing

  2. Fantastic!!!!!

    I have a doubt here, Could you please help me with that.

    If we have csv exactly like one which is given below. what DDL can be for it:
    bikeName, bikeFeature_json
    yamaha ray-z, {“enginetype”:”aircooled”, “cc”:”149.00″, “power”:”14.0″}
    Hero Meastro, {“enginetype”:”aircooled”, “cc”:”155.00″, “power”:”14.8″}

    Could you please provide Hive DDL for the above csv data.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close