IPL Matches Data Analysis Using Spark

Let’s mine the data of IPL and derive some important primitives from it like which stadium is most suitable for batting first and which stadium is most suitable for bowling first. You can download the data from the below link.

Here is the data set description:

id,season,city,date,team1,team2,toss_winner,toss_decision,result,dl_applied,winner,win_by_runs,win_by_wickets,player_of_match,venue,umpire1,umpire2,umpire3

1.Which stadium is best suitable for first batting

Here we evaluate that which stadium is most suitable for first batting. Here are the details how can we do that.

win_by_runs means – Team batted first and won the Match by margin of some runs.

win_by_wickets means – Team batted second and chased the target successfully.

So we will take out the columns toss_decision, won_by_runs, won_by_wickets, venue. From this we will filter out the columns which are having won_by_runs value as 0 so that we can get the teams which won by batting first. Here is the scala code to do that.

```val data = sc.textFile("file:///home/kiran/Documents/datasets/matches.csv")
val bat_first_won = extracting_columns.filter(x=>x._2!="0").map(x=>(x._4,1)) .reduceByKey(_+_).map(item => item.swap).sortByKey(false).collect.foreach(println)```

Code Explanation

In the first line of code we are loading the data from the local file system.

In the second line we are filtering the bad records if any are there i.e., the total number of columns are 19 if any record having less than 19 columns are filtered out.

In the third line we are extracting the columns that are required for our analysis i.e., toss_decision, won_by_runs, won_by_wickets, venue.

In the fourth line we are filtering the won_by_run column having more than 0 runs and we are preparing a key-value pair with the Venue column and a numeric 1 has been added to it so as to count the number of first_bat_wons in that stadium and finally we are sorting the records and printing all of them.

Output

```(30,"MA Chidambaram Stadium)
(24,Feroz Shah Kotla)
(22,Eden Gardens)
(7,Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium)
(4,Newlands)
(4,SuperSport Park)
(3,St George's Park)
(3,"Punjab Cricket Association IS Bindra Stadium)
(2,Shaheed Veer Narayan Singh International Stadium)
(2,Buffalo Park)
(1,De Beers Diamond Oval)
(1,OUTsurance Oval)

From this analysis as of now, we have got (30,”MA Chidambaram Stadium)

Here is the screen shot of the whole stack trace.

But this is not the final result, we need to evaluate the total number of matches that chidambaram stadium has been venue.

Let us see how many matches that each stadium has been venued. Here is the code to do that

```val data = sc.textFile("file:///home/kiran/Documents/datasets/matches.csv")
val total_matches_per_venue = filtering_bad_records.map(x=>(x(14),1)).reduceByKey(_+_).map(item => item.swap).sortByKey(false).collect.foreach(println)```

Here are the total number of matches each stadium has been venued.

Output

```(58,M Chinnaswamy Stadium)
(54,Eden Gardens)
(53,Feroz Shah Kotla)
(12,SuperSport Park)
(11,Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium)
(7,Newlands)
(7,St George's Park)
(7,"Punjab Cricket Association IS Bindra Stadium)
(6,Shaheed Veer Narayan Singh International Stadium)
(3,De Beers Diamond Oval)
(3,Buffalo Park)
(2,OUTsurance Oval)
(2,Green Park)

Chidambaram stadium has venued 48 matches in-total, in that 30 teams won by batting first.

So we will now see the winning percentage of each stadium for first_bat_won. Here is the code to do that.

`val join = bat_first_won.join(total_matches_per_venue).map(x=>(x._1,(x._2._1*100/x._2._2))).map(item => item.swap).sortByKey(false).collect.foreach(println)`

Here we have joined the two RDD’s i.e., bat_first_won and total_matches_per_venue and we have drawn out the percentage of first_bat_won venues by dividing the number of matches won by batting first and the total number of matches in that venue.

Here is the result of the percentages of each venue for first_bat_won

Output

```(66,"Vidarbha Cricket Association Stadium)
(66,Buffalo Park)
(63,Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium)
(57,Newlands)
(50,OUTsurance Oval)
(45,Feroz Shah Kotla)
(42,St George's Park)
(42,"Punjab Cricket Association IS Bindra Stadium)
(40,Eden Gardens)
(33,Shaheed Veer Narayan Singh International Stadium)
(33,De Beers Diamond Oval)
(33,SuperSport Park)

Vidarbha Cricket Association Stadium stands in the first place, but the total number of matches held there was only 3.

When we take the list of stadiums the top 4 stadiums with the highest first bat win percentage venued below 20 matches. But in Chidambaram stadium there held total 48 matches. When we take that streak, we can deduce that Ma Chidambaram Stadium is most suitable for first batting in the IPL.

In the similar way, let us see for the stadium which supports bowling

2.Which stadium is best suitable for first bowling

Here we evaluate that which stadium is most suitable for first batting. Here are the details how can we do that.

win_by_runs means – first bat won or second bowl

win_by_wickets means – second bat won or first bowl

So we will take out the columns toss_decision, won_by_runs, won_by_wickets, venue. From this we will filter out the columns which are having won_by_wickets value as 0 so that we can get the teams which won by batting first. Here is the scala code to do that.

```val data = sc.textFile("file:///home/kiran/Documents/datasets/matches.csv")
val bowl_first_won = extracting_columns.filter(x=>x._3!="0").map(x=>(x._4,1)) .reduceByKey(_+_).map(item => item.swap).sortByKey(false).collect.foreach(println)```

Code Explanation

In the first line of code we are loading the data from the local file system.

In the second line we are filtering the bad records if any are there i.e., the total number of columns are 19 if any record having less than 19 columns are filtered out.

In the thrid line we are extracting the columns that are required for our analysis i.e., toss_decision, won_by_runs, won_by_wickets, venue.

In the fourth line we are filtering the won_by_wickets column having more than 0 wickets and we are preparing a key-value pair with the Venue column and a numeric 1 has been added to it so as to count the number of first_bowl_wons in that stadium and finally we are sorting the records and printing all of them.

Here is the result of this analysis

Output

```(32,Eden Gardens)
(28,Feroz Shah Kotla)
(8,SuperSport Park)
(4,Shaheed Veer Narayan Singh International Stadium)
(4,St George's Park)
(4,Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium)
(4,"Punjab Cricket Association IS Bindra Stadium)
(2,Newlands)
(2,De Beers Diamond Oval)
(2,Green Park)
(1,OUTsurance Oval)
(1,Buffalo Park)```

We can see that Eden Gardens stands in the first place. Here is the screen shot of the whole stack trace.

Now we will see the percentage of first_bowl_won by taking the percentage of first_bowl_won and the total number of matches held in that stadium.

This code will find out the total of number of matches each stadium has venued.

```val data = sc.textFile("file:///home/kiran/Documents/datasets/matches.csv")
val total_matches_per_venue = filtering_bad_records.map(x=>(x(14),1)).reduceByKey(_+_).map(item => item.swap).sortByKey(false).collect.foreach(println)```

Now we will perform a join operation on the total number of matches in that venue and bowl_first_won and it can be done as follows:

`val join1 = bowl_first_won.join(total_matches_per_venue).map(x=>(x._1,(x._2._1*100/x._2._2))).map(item => item.swap).sortByKey(false).collect.foreach(println)`

Here is the percentage of first_bowl_won matches for each stadium
Output

```(100,Green Park)
(66,Shaheed Veer Narayan Singh International Stadium)
(66,De Beers Diamond Oval)
(66,SuperSport Park)
(59,Eden Gardens)
(57,St George's Park)
(57,"Punjab Cricket Association IS Bindra Stadium)
(52,Feroz Shah Kotla)
(50,OUTsurance Oval)
(36,Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium)
(33,Buffalo Park)
(28,Newlands)```

Green park stands in the first place but the number of matches held there was only 2 but the total number of matches held at Eden gardens were 40, out of 40 matches 32 matches won by bowling first. So if we take that winning streak, Eden gardens is most suitable for bowling first.

We hope this blog helped you in understanding how to perform analysis using apache spark. Keep visiting our site www.acadgild.com for more updates on Big data and other technologies.

kiran

1. Pingback: Hot reads for this week in machine learning and deep learning – Everything Artificial Intelligence
2. Shreyash says:

by executing below line of code, i am getting following error..
val join = bat_first_won.join(total_matches_per_venue).map(x=>(x._1,(x._2._1*100/x._2._2))).map(item => item.swap).sortByKey(false).collect.foreach(println)
error:
scala> val join = bat_first_won.join(total_matches_per_venue).map(x=>(x._1,(x._2._1*100/x._2._2))).map(item => item.swap).sortByKey(false).collect.foreach(println)
:37: error: value / is not a member of String
val join = bat_first_won.join(total_matches_per_venue).map(x=>(x._1,(x._2._1*100/x._2._2))).map(item => item.swap).sortByKey(false).collect.foreach(println)

1. Hi Shreyash,
Please make sure that you are getting proper integer values in x._2._1 RDD value by removing the divison operation i.e., /x.2.2 from the join rdd statement, if you are getting proper integers then try applying .toInt function on the RDD value as x._2._1.toInt*100/x._2._2.

1. Monisha says:

That didn’t work Kiran. Still I get an error:
:31: error: overloaded method value / with alternatives:
(x: Double)Double
(x: Float)Float
(x: Long)Long
(x: Int)Int
(x: Char)Int
(x: Short)Int
(x: Byte)Int
cannot be applied to (String)
val JoinData = bat_first_won.join(total_match_per_venue).map(x=>(x._1,((x._2._1.toInt)*100/x._2._2))).map(item => item.swap).sortByKey(false).collect.foreach(println
)

2. NITIN says:

sir can you please provide the code for cleaning IPL data i have 637 file i want to filter all the record into one csv please share the code
Thanks
[email protected]

2. rahul says:

do not use “swap” in variable like bowl_first_won
here data has been joined with v(int,string) and (int,string) and we are trying to divide string/string that is not a valid operation
val data = sc.textFile(“/user/rahulkrch92/data/ipl/matches.csv”)
val bowl_first_won = extracting_columns.filter(x=>x._3!=”0″).map(x=>(x._4,1)) .reduceByKey(_+_).sortByKey(false)
val join1 = bowl_first_won.join(total_matches_per_venue).map(x=>(x._1,((x._2._1*100)/x._2._2))).map(item => item.swap).sortByKey(false).collect.foreach(println)
*note: avoid swap in variable if you want to use further because it will create confusion while joining two datasets

3. aman says:

Above code snippet does not look good . Filter condition is not correct .

4. Saurabh says:

Thanks for sharing this intersting post.

5. Lamont Burigsay says:

It’s pretty worth enough for me. In my opinion, if all webmasters and bloggers made good content as you did, the web will be a fantasy sport

6. Debraj says:

val reduced_rdd = mofm.reduceByKey(_ + _).map(item => item.swap).sortByKey(false).take(5)

Howto save the result to the following path: /home/scrapbook/tutorial/IPLData

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Close