In this blog, we will work on a use case involving electric bulbs and work with the date and time concepts in Apache Pig.
In this instance, Pig is used in the local mode to load the local data. We can use Pig in HDFS mode as per our convenience.
In the research center of bulb manufacturing companies, the longevity of bulbs is tested by subjecting them to adverse conditions.
The dataset used in this case is a sample from the light bulb production house where bulbs are tested at random intervals of time. The first column is StartDate which is the date and time when the testing of the bulb started and the second column is EndDate which is the date when the testing ended.
30-Jun-2018 23:42 04-Jul-2018 15:10
30-Jun-2018 23:13 30-Jun-2019 23:34m
A few rows may be empty which indicates that data is not available, maybe because of various reasons. But as a developer we need not worry about missing data. With the help of Data Filtering, we can remove the unnecessary data.
Loading Data into the Pig environment
Since Pig uses default as tab(\t) delimited data, it’s not mandatory to state USING PigStorage(‘\t’) in the code while loading, nevertheless it is good to write it. You have to use this parameter depending on the dataset.
Since we have data inside Pig, the first step is to filter data in the column we are working on.
Here we remove all the rows with null data.
In this step, it is mandatory to filter all the data in EndTime containing – symbol.
We have to convert the data loaded in Pig into datetime format in order to work with it.
Here, we use two predefined functions:
The first one converts the character array to datetime readable structure which can be interpreted by Pig and the second one takes the difference between two DateTime parameters provided.
The ToDate function can be used in different formats of year, month and date. Some examples are as follows:
Depending on the appropriate structure in the dataset provided, we can choose the format.
After simple filtering and conversion of character array data to datetime format, we have now determined the difference in terms of minutes for every bulb which was in ON state during testing.
We can see the results with dump command.
Result in minutes is displayed:
Once we achieve this, we can perform analysis on the result, for example, to find the maximum time a bulb can stay ON or minimum time and so on..
Shown below is the result for the average time the bulbs were ON during the testing phase.
This way we can perform analysis on the filtered result and get the results with help of Pig in a matter of minutes from a large set of data.
For dataset and code for practice, click HERE.
For more such blogs on various topics, please visit ACADGILD.