Apache Pig is a high-level tool for processing big data. The language used is known as Pig Latin. Pig can run in two different modes. In this blog, we will see web log analysis using Apache Pig. Before we proceed, let us see a brief introduction about Apache Pig.
MapReduce mode: Data gets loaded from HDFS and against every transformation, Map Reduce job is executed in the backend.
Local Mode: Generally, run for testing the script. Data gets loaded from local file system, and no Map Reduce jobs run in the backend which makes the testing fast.
Pig Latin is procedural and uses lazy evaluation, and ETL (Extract, Transform, Load). It has a rich set of libraries for loading weblogs. All the functionalities for this can be found in Piggybank jar which comes bundled with Apache Pig.
We will use CombinedLogLoader() to load logs.
The logs used is in Combined log format. Let us see the sample structure of it!
127.0.0.1 – – [10/Oct/2000:13:55:36 -0700] “GET /apache_pb.gif HTTP/1.0” 200 2326 “http://www.example.com/start.html” “Mozilla/4.08 [en] (Win98; I ;Nav)”
Please find some of the column description of combined log format (web log format) data below:
Ipaddress : 127.0.0.1) ip address of the client (hostname).
Logname: (-) The “hyphen” in the output indicates that the requested piece of information is not available.
Userid: (-) NA
Timestamp: (10/Oct/2000:13:55:36 -0700) time at which server finished processing request.
Request: (GET /apache_pb.gif HTTP/1.0) request made by client. Denoted by “GET”
Page link: (http://www.example.com/start.html)web page through which client made a request.
Download the dataset from here.
PROBLEM STATEMENT 1:
Find out the most viewed page
Below are the steps:
Step 1) First and foremost, we have to register the Piggybank jar to use its classes.
Step 2) Next, load the data using CombinedLogLoader() and specify the schema.
Step 3) Group the data by page link to count the page hits of each unique link.
Step 4) For every grouped data (grouped by link) we have to generate the link and its total count. Here, we have used flatten() to explode the tuples and then count the hits.
Step 5) Once COUNT is received, we need to order it in descending order and generate the only first result.
Step 6) use dump to get the desired result.
PROBLEM STATEMENT 2:
Find total hits per unique day:
Based on each unique day we need to find the total hits. For example, on 24th of a particular month, there were X hits, on 27th of the month, there can be Y hits.
The assumption has been made that logs are of a single month.
To solve this problem, we have to use DateExtractor() available in Piggybank jar. This will take the timestamp as input and will give corresponding “day” against each timestamp.
Step 1) Define the DateExtractor() in the Pig Grunt shell as shown below:
Step 2) Use the above class defined to extract the day and group by it.
Step 3) To find the unique hits per day, run the below command.
Step 4) Dump the result and see the output.
The first column of the output is the date, and the second is the total number of hits on that day.
Hope this post was helpful to you in performing web log analysis using Apache Pig. In case of any queries, feel free to comment below and we will get back to you at the earliest. Keep visiting www.acadgild.com for more updates on Big Data and other technologies.