Free Shipping

Secure Payment

easy returns

24/7 support

Performance Analysis of Tez

 July 14  | 0 Comments

In this post, we will be running Pig scripts and Hive queries in both YARN as well as the TEZ engine. We will be analysing how the performance varies and which will be faster whether YARN or TEZ.

Pig Script on Tez and YARN

Now let us write a Pig script for a dictionary called AFINN, in which 2477 words are rated from -5 to +5 based on the words meaning. In the Pig script we will be counting how many positive words (0-5) and negative words (-5 to -1) are there.
The Pig script for calculating the number of negative and positive words in the dictionary looks like as shown below:

A = LOAD '/AFINN.txt' USING PigStorage() AS (name:chararray,rating:int);
B = FOREACH A GENERATE name,rating,(rating>=0?'positive':'negative') as term:chararray;
C = GROUP B by term;
D = FOREACH C GENERATE group,COUNT(B.term);
STORE D INTO '/AFINN/'

Now, let’s save the output of the script in HDFS /AFINN/yarn/ directory for YARN output and /AFINN/tez/ for the output from Tez. Let’s assign the name the file containing the above Pig script as dictionary.pig.

Pig on YARN

Let’s run the above code using YARN engine and note down the time.

kiran@ACD-KIRAN:~/Desktop$ pig dictionary.pig
16/02/01 19:16:44 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
16/02/01 19:16:44 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
16/02/01 19:16:44 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2016-02-01 19:16:44,807 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35
2016-02-01 19:16:44,807 [main] INFO org.apache.pig.Main - Logging error messages to: /home/kiran/Desktop/pig_1454334404805.log
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/kiran/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/kiran/tez/tez/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.7.1 0.15.0 kiran 2016-02-01 19:16:47 2016-02-01 19:17:08 GROUP_BY
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_1454067435808_0016 1 1 3 3 3 3 2 2 2 2 A,B,C,D GROUP_BY,COMBINER /AFINN/yarn,
Input(s):
Successfully read 2477 records (28452 bytes) from: "/AFINN.txt"
Output(s):
Successfully stored 2 records (27 bytes) in: "/AFINN/yarn"
Counters:
Total records written : 2
Total bytes written : 27
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1454067435808_0016
2016-02-01 19:17:08,640 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2016-02-01 19:17:08,643 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-02-01 19:17:08,675 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2016-02-01 19:17:08,678 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-02-01 19:17:08,714 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2016-02-01 19:17:08,718 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-02-01 19:17:08,771 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2016-02-01 19:17:08,801 [main] INFO org.apache.pig.Main - Pig script completed in 24 seconds and 88 milliseconds (24088 ms)
kiran@ACD-KIRAN:~/Desktop$

We can see that YARN took 24 seconds and 88 milliseconds to complete this job. Now, let us run the same script using TEZ engine.

Pig on TEZ

The command for running Pig using Tez engine is as follows:

pig -x tez dictionary.pig
kiran@ACD-KIRAN:~/Desktop$ pig -x tez dictionary.pig
16/02/01 19:19:25 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
16/02/01 19:19:25 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
16/02/01 19:19:25 INFO pig.ExecTypeProvider: Trying ExecType : TEZ_LOCAL
16/02/01 19:19:25 INFO pig.ExecTypeProvider: Trying ExecType : TEZ
16/02/01 19:19:25 INFO pig.ExecTypeProvider: Picked TEZ as the ExecType
2016-02-01 19:19:25,884 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35
2016-02-01 19:19:25,884 [main] INFO org.apache.pig.Main - Logging error messages to: /home/kiran/Desktop/pig_1454334565883.log
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/kiran/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/kiran/tez/tez/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2016-02-01 19:19:40,239 [PigTezLauncher-0] INFO org.apache.tez.common.counters.Limits - Counter limits initialized with parameters: GROUP_NAME_MAX=256, MAX_GROUPS=500, COUNTER_NAME_MAX=64, MAX_COUNTERS=120
2016-02-01 19:19:40,242 [PigTezLauncher-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=SUCCEEDED, progress=TotalTasks: 2 Succeeded: 2 Running: 0 Failed: 0 Killed: 0, diagnostics=, counters=Counters: 56
org.apache.tez.common.counters.DAGCounter
NUM_SUCCEEDED_TASKS=2
TOTAL_LAUNCHED_TASKS=2
DATA_LOCAL_TASKS=1
AM_CPU_MILLISECONDS=1040
AM_GC_TIME_MILLIS=0
File System Counters
FILE_BYTES_READ=146
FILE_BYTES_WRITTEN=82
FILE_READ_OPS=0
FILE_LARGE_READ_OPS=0
FILE_WRITE_OPS=0
HDFS_BYTES_READ=28094
HDFS_BYTES_WRITTEN=27
HDFS_READ_OPS=4
HDFS_LARGE_READ_OPS=0
HDFS_WRITE_OPS=2
org.apache.tez.common.counters.TaskCounter
REDUCE_INPUT_GROUPS=2
REDUCE_INPUT_RECORDS=2
COMBINE_INPUT_RECORDS=0
SPILLED_RECORDS=4
NUM_SHUFFLED_INPUTS=1
NUM_SKIPPED_INPUTS=0
NUM_FAILED_SHUFFLE_INPUTS=0
MERGED_MAP_OUTPUTS=1
GC_TIME_MILLIS=140
CPU_MILLISECONDS=3480
PHYSICAL_MEMORY_BYTES=353894400
VIRTUAL_MEMORY_BYTES=1667567616
COMMITTED_HEAP_BYTES=353894400
INPUT_RECORDS_PROCESSED=2477
OUTPUT_RECORDS=2479
OUTPUT_BYTES=39632
OUTPUT_BYTES_WITH_OVERHEAD=46
OUTPUT_BYTES_PHYSICAL=50
ADDITIONAL_SPILLS_BYTES_WRITTEN=0
ADDITIONAL_SPILLS_BYTES_READ=50
ADDITIONAL_SPILL_COUNT=0
SHUFFLE_CHUNK_COUNT=1
SHUFFLE_BYTES=50
SHUFFLE_BYTES_DECOMPRESSED=46
SHUFFLE_BYTES_TO_MEM=0
SHUFFLE_BYTES_TO_DISK=0
SHUFFLE_BYTES_DISK_DIRECT=50
NUM_MEM_TO_DISK_MERGES=0
NUM_DISK_TO_DISK_MERGES=0
SHUFFLE_PHASE_TIME=160
MERGE_PHASE_TIME=172
FIRST_EVENT_RECEIVED=153
LAST_EVENT_RECEIVED=153
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
org.apache.hadoop.mapreduce.TaskCounter
COMBINE_INPUT_RECORDS=2
COMBINE_OUTPUT_RECORDS=2477
2016-02-01 19:19:40,267 [PigTezLauncher-0] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2016-02-01 19:19:41,054 [main] INFO org.apache.pig.tools.pigstats.tez.TezPigScriptStats - Script Statistics:
HadoopVersion: 2.7.1
PigVersion: 0.15.0
TezVersion: 0.8.1-alpha
UserId: kiran
FileName: dictionary.pig
StartedAt: 2016-02-01 19:19:28
FinishedAt: 2016-02-01 19:19:41
Features: GROUP_BY
Success!
DAG PigLatin:dictionary.pig-0_scope-0:
ApplicationId: job_1454067435808_0017
TotalLaunchedTasks: 2
FileBytesRead: 146
FileBytesWritten: 82
HdfsBytesRead: 28094
HdfsBytesWritten: 27
Input(s):
Successfully read 2477 records (28094 bytes) from: "/AFINN.txt"
Output(s):
Successfully stored 2 records (27 bytes) in: "/AFINN/tez"
2016-02-01 19:19:41,072 [main] INFO org.apache.pig.Main - Pig script completed in 15 seconds and 295 milliseconds (15295 ms)
2016-02-01 19:19:41,072 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher - Shutting down thread pool
2016-02-01 19:19:41,085 [Thread-15] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager - Shutting down Tez session org.apache.tez.client.TezClient@493238ed
2016-02-01 19:19:41,086 [Thread-15] INFO org.apache.tez.client.TezClient - Shutting down Tez Session, sessionName=PigLatin:dictionary.pig, applicationId=application_1454067435808_0017
kiran@ACD-KIRAN:~/Desktop$

We can see that Tez completed the job in just 15 seconds and 295 milliseconds.

Hadoop

HIVE ON YARN and TEZ

Here we will create a hive table and load a dictionary dataset which we have into the table and we will run a hive query for calculating the number of positive and negative words are there in the dictionary.

Creation of hive table and loading the dataset is as shown below:

hive> create external table dictionary_yarn(name string,rating INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.507 seconds
hive> LOAD DATA INPATH '/AFINN.txt' into table dictionary_yarn;
Loading data to table default.dictionary_yarn
Table default.dictionary_yarn stats: [numFiles=1, numRows=0, totalSize=28094, rawDataSize=0]
OK
Time taken: 0.195 seconds
hive>

HIVE ON YARN

Let’s run the query for counting the number of positive and negative words in the dictionary on YARN engine.

hive> select sum(case when rating >= 0 then 1 else 0 end) as positive,sum(case when rating < 0 then 1 else 0 end) as negative from dictionary_yarn;
Query ID = kiran_20160201195817_827eba29-f2ce-47cd-b491-3c4da6e5d0b2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1454067435808_0023, Tracking URL = http://ACD-KIRAN:8088/proxy/application_1454067435808_0023/
Kill Command = /home/kiran/hadoop-2.7.1/bin/hadoop job -kill job_1454067435808_0023
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-02-01 19:58:23,335 Stage-1 map = 0%, reduce = 0%
2016-02-01 19:58:28,518 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.6 sec
2016-02-01 19:58:33,719 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.04 sec
MapReduce Total cumulative CPU time: 3 seconds 40 msec
Ended Job = job_1454067435808_0023
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.04 sec HDFS Read: 36808 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 40 msec
OK
879 1598
Time taken: 18.282 seconds, Fetched: 1 row(s)
hive>

You can see that Hive on YARN took 18.282 seconds.

HIVE ON TEZ

Now, let’s run the same query on Tez engine.
To make a Hive query run on Tez engine, we need to set the Hive engine explicitly by using the below command:

set hive.execution.engine=tez;
hive> set hive.execution.engine=tez;
hive> select sum(case when rating >= 0 then 1 else 0 end) as positive,sum(case when rating < 0 then 1 else 0 end) as negative from dictionary_yarn;
Query ID = kiran_20160201200130_a5c56388-26f5-48dd-a925-26c5e1d7e2b8
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1454067435808_0024)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 5.48 s
--------------------------------------------------------------------------------
OK
879 1598
Time taken: 10.177 seconds, Fetched: 1 row(s)
hive>

We can see that Hive on Tez took 10.177 seconds to run the same query.
We can check whether the job is running in YARN or TEZ engine by checking it in the Resource manager’s web UI.
localhost:8088

In the above screen shot, we can see the job application_id and its Application type. Application type gives the engine on which the script had run. In the above screen shot, we have the application id’s and their engines for Hive, on which we ran the earlier query.
By this we can say that Tez engine is faster than YARN engine.
Hope this post has provided you a clear picture about running Pig scripts and Hive queries on both YARN and Tez engine and analyzing their performances.
Keep visiting our site www.acadgild.com for more updates on Bigdata and other technologies.

>