All Categories

IPL Matches Data Analysis Using Spark

Let’s mine the data of IPL and derive some important primitives from it like which stadium is most suitable for batting first and which stadium is most suitable for bowling first. You can download the data from the below link.

https://drive.google.com/open?id=0ByJLBTmJojjzRm5vX1E2cURtTTQ

Here is the data set description:

id,season,city,date,team1,team2,toss_winner,toss_decision,result,dl_applied,winner,win_by_runs,win_by_wickets,player_of_match,venue,umpire1,umpire2,umpire3

1.Which stadium is best suitable for first batting

Here we evaluate that which stadium is most suitable for first batting. Here are the details how can we do that.

win_by_runs means – Team batted first and won the Match by margin of some runs.

win_by_wickets means – Team batted second and chased the target successfully.

So we will take out the columns toss_decision, won_by_runs, won_by_wickets, venue. From this we will filter out the columns which are having won_by_runs value as 0 so that we can get the teams which won by batting first. Here is the scala code to do that.

val data = sc.textFile("file:///home/kiran/Documents/datasets/matches.csv")
val filtering_bad_records = data.map(line=>line.split(",")).filter(x=>x.length<19)
val extracting_columns = filtering_bad_records.map(x=>(x(7),x(11),x(12),x(14)))
val bat_first_won = extracting_columns.filter(x=>x._2!="0").map(x=>(x._4,1)) .reduceByKey(_+_).map(item => item.swap).sortByKey(false).collect.foreach(println)

Code Explanation

In the first line of code we are loading the data from the local file system.

In the second line we are filtering the bad records if any are there i.e., the total number of columns are 19 if any record having less than 19 columns are filtered out.

In the third line we are extracting the columns that are required for our analysis i.e., toss_decision, won_by_runs, won_by_wickets, venue.

In the fourth line we are filtering the won_by_run column having more than 0 runs and we are preparing a key-value pair with the Venue column and a numeric 1 has been added to it so as to count the number of first_bat_wons in that stadium and finally we are sorting the records and printing all of them.

Output

(30,"MA Chidambaram Stadium)
(25,Wankhede Stadium)
(24,M Chinnaswamy Stadium)
(24,Feroz Shah Kotla)
(22,Eden Gardens)
(15,"Punjab Cricket Association Stadium)
(14,"Rajiv Gandhi International Stadium)
(11,Subrata Roy Sahara Stadium)
(10,Sawai Mansingh Stadium)
(9,Kingsmead)
(7,Dr DY Patil Sports Academy)
(7,Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium)
(6,"Sardar Patel Stadium)
(6,Brabourne Stadium)
(5,Himachal Pradesh Cricket Association Stadium)
(4,Newlands)
(4,SuperSport Park)
(4,Barabati Stadium)
(3,St George's Park)
(3,Sheikh Zayed Stadium)
(3,New Wanderers Stadium)
(3,Nehru Stadium)
(3,Maharashtra Cricket Association Stadium)
(3,"Punjab Cricket Association IS Bindra Stadium)
(3,Dubai International Cricket Stadium)
(2,Shaheed Veer Narayan Singh International Stadium)
(2,"Vidarbha Cricket Association Stadium)
(2,JSCA International Stadium Complex)
(2,Buffalo Park)
(2,Sharjah Cricket Stadium)
(1,De Beers Diamond Oval)
(1,OUTsurance Oval)
(1,Saurashtra Cricket Association Stadium)

From this analysis as of now, we have got (30,”MA Chidambaram Stadium)

Here is the screen shot of the whole stack trace.

But this is not the final result, we need to evaluate the total number of matches that chidambaram stadium has been venue.

Let us see how many matches that each stadium has been venued. Here is the code to do that

val data = sc.textFile("file:///home/kiran/Documents/datasets/matches.csv")
val filtering_bad_records1 = data.map(line=>line.split(",")).filter(x=>x.length<19)
val total_matches_per_venue = filtering_bad_records.map(x=>(x(14),1)).reduceByKey(_+_).map(item => item.swap).sortByKey(false).collect.foreach(println)

Here are the total number of matches each stadium has been venued.

Output

(58,M Chinnaswamy Stadium)
(54,Eden Gardens)
(53,Feroz Shah Kotla)
(49,Wankhede Stadium)
(48,"MA Chidambaram Stadium)
(41,"Rajiv Gandhi International Stadium)
(35,"Punjab Cricket Association Stadium)
(33,Sawai Mansingh Stadium)
(17,Dr DY Patil Sports Academy)
(17,Subrata Roy Sahara Stadium)
(15,Kingsmead)
(12,"Sardar Patel Stadium)
(12,SuperSport Park)
(11,Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium)
(11,Brabourne Stadium)
(9,Himachal Pradesh Cricket Association Stadium)
(8,New Wanderers Stadium)
(8,Maharashtra Cricket Association Stadium)
(7,Newlands)
(7,St George's Park)
(7,Sheikh Zayed Stadium)
(7,JSCA International Stadium Complex)
(7,"Punjab Cricket Association IS Bindra Stadium)
(7,Dubai International Cricket Stadium)
(7,Barabati Stadium)
(6,Shaheed Veer Narayan Singh International Stadium)
(6,Sharjah Cricket Stadium)
(5,Nehru Stadium)
(5,Saurashtra Cricket Association Stadium)
(3,De Beers Diamond Oval)
(3,"Vidarbha Cricket Association Stadium)
(3,Buffalo Park)
(2,OUTsurance Oval)
(2,Green Park)
(2,Holkar Cricket Stadium)

Chidambaram stadium has venued 48 matches in-total, in that 30 teams won by batting first.

So we will now see the winning percentage of each stadium for first_bat_won. Here is the code to do that.

val join = bat_first_won.join(total_matches_per_venue).map(x=>(x._1,(x._2._1*100/x._2._2))).map(item => item.swap).sortByKey(false).collect.foreach(println)

Here we have joined the two RDD’s i.e., bat_first_won and total_matches_per_venue and we have drawn out the percentage of first_bat_won venues by dividing the number of matches won by batting first and the total number of matches in that venue.

Here is the result of the percentages of each venue for first_bat_won

Output

(66,"Vidarbha Cricket Association Stadium)
(66,Buffalo Park)
(64,Subrata Roy Sahara Stadium)
(63,Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium)
(62,"MA Chidambaram Stadium)
(60,Kingsmead)
(60,Nehru Stadium)
(57,Newlands)
(57,Barabati Stadium)
(55,Himachal Pradesh Cricket Association Stadium)
(54,Brabourne Stadium)
(51,Wankhede Stadium)
(50,"Sardar Patel Stadium)
(50,OUTsurance Oval)
(45,Feroz Shah Kotla)
(42,St George's Park)
(42,Sheikh Zayed Stadium)
(42,"Punjab Cricket Association IS Bindra Stadium)
(42,Dubai International Cricket Stadium)
(42,"Punjab Cricket Association Stadium)
(41,Dr DY Patil Sports Academy)
(41,M Chinnaswamy Stadium)
(40,Eden Gardens)
(37,New Wanderers Stadium)
(37,Maharashtra Cricket Association Stadium)
(34,"Rajiv Gandhi International Stadium)
(33,Shaheed Veer Narayan Singh International Stadium)
(33,De Beers Diamond Oval)
(33,SuperSport Park)
(33,Sharjah Cricket Stadium)
(30,Sawai Mansingh Stadium)
(28,JSCA International Stadium Complex)
(20,Saurashtra Cricket Association Stadium)

Vidarbha Cricket Association Stadium stands in the first place, but the total number of matches held there was only 3.

When we take the list of stadiums the top 4 stadiums with the highest first bat win percentage venued below 20 matches. But in Chidambaram stadium there held total 48 matches. When we take that streak, we can deduce that Ma Chidambaram Stadium is most suitable for first batting in the IPL.

In the similar way, let us see for the stadium which supports bowling

Hadoop

2.Which stadium is best suitable for first bowling

Here we evaluate that which stadium is most suitable for first batting. Here are the details how can we do that.

win_by_runs means – first bat won or second bowl

win_by_wickets means – second bat won or first bowl

So we will take out the columns toss_decision, won_by_runs, won_by_wickets, venue. From this we will filter out the columns which are having won_by_wickets value as 0 so that we can get the teams which won by batting first. Here is the scala code to do that.

val data = sc.textFile("file:///home/kiran/Documents/datasets/matches.csv")
val filtering_bad_records = data.map(line=>line.split(",")).filter(x=>x.length<19)
val extracting_columns = filtering_bad_records.map(x=>(x(7),x(11),x(12),x(14)))
val bowl_first_won = extracting_columns.filter(x=>x._3!="0").map(x=>(x._4,1)) .reduceByKey(_+_).map(item => item.swap).sortByKey(false).collect.foreach(println)

Code Explanation

In the first line of code we are loading the data from the local file system.

In the second line we are filtering the bad records if any are there i.e., the total number of columns are 19 if any record having less than 19 columns are filtered out.

In the thrid line we are extracting the columns that are required for our analysis i.e., toss_decision, won_by_runs, won_by_wickets, venue.

In the fourth line we are filtering the won_by_wickets column having more than 0 wickets and we are preparing a key-value pair with the Venue column and a numeric 1 has been added to it so as to count the number of first_bowl_wons in that stadium and finally we are sorting the records and printing all of them.

Here is the result of this analysis

Output

(32,Eden Gardens)
(31,M Chinnaswamy Stadium)
(28,Feroz Shah Kotla)
(26,"Rajiv Gandhi International Stadium)
(24,Wankhede Stadium)
(23,Sawai Mansingh Stadium)
(20,"Punjab Cricket Association Stadium)
(17,"MA Chidambaram Stadium)
(10,Dr DY Patil Sports Academy)
(8,SuperSport Park)
(6,Kingsmead)
(6,Subrata Roy Sahara Stadium)
(5,"Sardar Patel Stadium)
(5,JSCA International Stadium Complex)
(5,New Wanderers Stadium)
(5,Maharashtra Cricket Association Stadium)
(5,Brabourne Stadium)
(4,Shaheed Veer Narayan Singh International Stadium)
(4,St George's Park)
(4,Himachal Pradesh Cricket Association Stadium)
(4,Sharjah Cricket Stadium)
(4,Saurashtra Cricket Association Stadium)
(4,Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium)
(4,"Punjab Cricket Association IS Bindra Stadium)
(4,Dubai International Cricket Stadium)
(3,Sheikh Zayed Stadium)
(3,Barabati Stadium)
(2,Newlands)
(2,De Beers Diamond Oval)
(2,Nehru Stadium)
(2,Green Park)
(2,Holkar Cricket Stadium)
(1,"Vidarbha Cricket Association Stadium)
(1,OUTsurance Oval)
(1,Buffalo Park)

We can see that Eden Gardens stands in the first place. Here is the screen shot of the whole stack trace.

Now we will see the percentage of first_bowl_won by taking the percentage of first_bowl_won and the total number of matches held in that stadium.

This code will find out the total of number of matches each stadium has venued.

val data = sc.textFile("file:///home/kiran/Documents/datasets/matches.csv")
val filtering_bad_records1 = data.map(line=>line.split(",")).filter(x=>x.length<19)
val total_matches_per_venue = filtering_bad_records.map(x=>(x(14),1)).reduceByKey(_+_).map(item => item.swap).sortByKey(false).collect.foreach(println)

Now we will perform a join operation on the total number of matches in that venue and bowl_first_won and it can be done as follows:

val join1 = bowl_first_won.join(total_matches_per_venue).map(x=>(x._1,(x._2._1*100/x._2._2))).map(item => item.swap).sortByKey(false).collect.foreach(println)

Here is the percentage of first_bowl_won matches for each stadium
Output

(100,Green Park)
(100,Holkar Cricket Stadium)
(80,Saurashtra Cricket Association Stadium)
(71,JSCA International Stadium Complex)
(69,Sawai Mansingh Stadium)
(66,Shaheed Veer Narayan Singh International Stadium)
(66,De Beers Diamond Oval)
(66,SuperSport Park)
(66,Sharjah Cricket Stadium)
(63,"Rajiv Gandhi International Stadium)
(62,New Wanderers Stadium)
(62,Maharashtra Cricket Association Stadium)
(59,Eden Gardens)
(58,Dr DY Patil Sports Academy)
(57,St George's Park)
(57,"Punjab Cricket Association IS Bindra Stadium)
(57,Dubai International Cricket Stadium)
(57,"Punjab Cricket Association Stadium)
(53,M Chinnaswamy Stadium)
(52,Feroz Shah Kotla)
(50,OUTsurance Oval)
(48,Wankhede Stadium)
(45,Brabourne Stadium)
(44,Himachal Pradesh Cricket Association Stadium)
(42,Sheikh Zayed Stadium)
(42,Barabati Stadium)
(41,"Sardar Patel Stadium)
(40,Kingsmead)
(40,Nehru Stadium)
(36,Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium)
(35,"MA Chidambaram Stadium)
(35,Subrata Roy Sahara Stadium)
(33,"Vidarbha Cricket Association Stadium)
(33,Buffalo Park)
(28,Newlands)

Green park stands in the first place but the number of matches held there was only 2 but the total number of matches held at Eden gardens were 40, out of 40 matches 32 matches won by bowling first. So if we take that winning streak, Eden gardens is most suitable for bowling first.

We hope this blog helped you in understanding how to perform analysis using apache spark. Keep visiting our site www.acadgild.com for more updates on Big data and other technologies.

Spark

Tags

7 Comments

  1. Pingback: Hot reads for this week in machine learning and deep learning – Everything Artificial Intelligence
  2. by executing below line of code, i am getting following error..
    val join = bat_first_won.join(total_matches_per_venue).map(x=>(x._1,(x._2._1*100/x._2._2))).map(item => item.swap).sortByKey(false).collect.foreach(println)
    error:
    scala> val join = bat_first_won.join(total_matches_per_venue).map(x=>(x._1,(x._2._1*100/x._2._2))).map(item => item.swap).sortByKey(false).collect.foreach(println)
    :37: error: value / is not a member of String
    val join = bat_first_won.join(total_matches_per_venue).map(x=>(x._1,(x._2._1*100/x._2._2))).map(item => item.swap).sortByKey(false).collect.foreach(println)

    1. Hi Shreyash,
      Please make sure that you are getting proper integer values in x._2._1 RDD value by removing the divison operation i.e., /x.2.2 from the join rdd statement, if you are getting proper integers then try applying .toInt function on the RDD value as x._2._1.toInt*100/x._2._2.

      1. That didn’t work Kiran. Still I get an error:
        :31: error: overloaded method value / with alternatives:
        (x: Double)Double
        (x: Float)Float
        (x: Long)Long
        (x: Int)Int
        (x: Char)Int
        (x: Short)Int
        (x: Byte)Int
        cannot be applied to (String)
        val JoinData = bat_first_won.join(total_match_per_venue).map(x=>(x._1,((x._2._1.toInt)*100/x._2._2))).map(item => item.swap).sortByKey(false).collect.foreach(println
        )

    2. do not use “swap” in variable like bowl_first_won
      here data has been joined with v(int,string) and (int,string) and we are trying to divide string/string that is not a valid operation
      please follow below code snippet for reference
      val data = sc.textFile(“/user/rahulkrch92/data/ipl/matches.csv”)
      val bowl_first_won = extracting_columns.filter(x=>x._3!=”0″).map(x=>(x._4,1)) .reduceByKey(_+_).sortByKey(false)
      val filtering_bad_records1 = data.map(line=>line.split(“,”)).filter(x=>x.length(x(14),1)).reduceByKey(_+_).sortByKey(false)
      val join1 = bowl_first_won.join(total_matches_per_venue).map(x=>(x._1,((x._2._1*100)/x._2._2))).map(item => item.swap).sortByKey(false).collect.foreach(println)
      *note: avoid swap in variable if you want to use further because it will create confusion while joining two datasets

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close