In our earlier posts of Machine Learning with Spark, we had seen how our data looks like, along with its headers. However, that description was not sufficient to provide a complete business view. To have a complete grasp of the problem, we should know every part of the attributes of the data.
Following is the mapping for each attribute’s numerical value to its actual categorical values. This gives us enough information about which attribute value corresponds to what significance in actual business.
Attributes:
Creditability :
Account Balance (DM i.e. Deutsche Mark,western Germany currency):
Duration of Credit (month): Continuous variable
Purpose of loan :
Credit Amount : Continuous variable
Length of current employment :
Instalment per cent :
Sex & Marital Status :
Guarantors:
Duration in Current address:
Most valuable available asset :
Age (years) : Continuous variable
Type of apartment :
No of Credits at this Bank :
Occupation :
No of dependents:
Telephone :
Foreign Worker :
|
Now, let’s do a quick check on the average account balance, credit amount and loan duration as per the credibility.
# register the Customers frame as table Customers.registerTempTable(“credit”) # query the credability table to check average balance amount,average loan and average duration for # each class of customer i.e. 1 and 0 results = sqlContext.sql(“SELECT creditability, avg(balance) as avgbalance, avg(amount) as avgamt, \ avg(duration) as avgdur FROM credit GROUP BY creditability “) # check the result of the query results.show() |
+————-+——————+——————+——————+
|creditability| avgbalance| avgamt| avgdur|
+————-+——————+——————+——————+
| 1.0|2.8657142857142857| 2985.442857142857|19.207142857142856|
| 0.0|1.9033333333333333|3938.1266666666666| 24.86|
+————-+——————+——————+——————+
Similarly, we can do a quick check on the statistical summary of all numerical columns in the data frame, as shown below.
Customers.describe().show() |
This becomes very difficult to identify the summary for each column. Therefore, we can see the summary by separating them out.
Customers.describe(‘creditability’,’balance’,’duration’,’history’,’purpose’).show() Customers.describe(“amount”,”savings”, “employment”, “instPercent”,”sexMarried”).show() Customers.describe(“guarantors”,”residenceDuration”, “assets”, “age”, “concCredit”).show() Customers.describe(“apartment”,”credits”, “occupation”, “dependents”, “hasPhone”).show() Customers.describe(“foreign”).show() |
The above stats does not give a complete view of the data because these values look all numerical but they all are not continuous. Hence, the above stats might not work on all of them.
Barring ‘creditability’, ’age’ and ‘amount’, every other column is categorical i.e. they are binned into pre-defined categories. You can check those categories from the above table.
The above stats of mean, median, devian, max, min will be good for continuous variables. For categorical variables, we will have to break them into cubes and then we can have a good look at their summary.
First, let’s take a look at the distinct categories in each attribute in the data.
Customers.select(‘creditability’).distinct().show() |
+————-+
|creditability|
+————-+
| 0.0|
| 1.0|
+————-+
Customers.select(‘balance’).distinct().show() |
+——-+
|balance|
+——-+
| 1.0|
| 4.0|
| 3.0|
| 2.0|
+——-+
Customers.select(‘history’).distinct().show() |
+——-+
|history|
+——-+
| 0.0|
| 1.0|
| 4.0|
| 3.0|
| 2.0|
+——-+
Customers.select(‘purpose’).distinct().show() |
+——-+
|purpose|
+——-+
| 8.0|
| 0.0|
| 1.0|
| 4.0|
| 3.0|
| 2.0|
| 10.0|
| 6.0|
| 5.0|
| 9.0|
+——-+
Customers.select(‘savings’).distinct().show() |
+——-+
|savings|
+——-+
| 1.0|
| 4.0|
| 3.0|
| 2.0|
| 5.0|
+——-+
Customers.select(’employment’).distinct().show() |
+———-+
|employment|
+———-+
| 1.0|
| 4.0|
| 3.0|
| 2.0|
| 5.0|
+———-+
Customers.select(‘instPercent’).distinct().show() |
+———–+
|instPercent|
+———–+
| 1.0|
| 4.0|
| 3.0|
| 2.0|
+———–+
Customers.select(‘sexMarried’).distinct().show() |
+———-+
|sexMarried|
+———-+
| 1.0|
| 4.0|
| 3.0|
| 2.0|
+———-+
Customers.select(‘guarantors’).distinct().show() |
+———-+
|guarantors|
+———-+
| 1.0|
| 3.0|
| 2.0|
+———-+
Customers.select(‘residenceDuration’).distinct().show() |
+—————–+
|residenceDuration|
+—————–+
| 1.0|
| 4.0|
| 3.0|
| 2.0|
+—————–+
Customers.select(‘concCredit’).distinct().show() |
+———-+
|concCredit|
+———-+
| 1.0|
| 3.0|
| 2.0|
+———-+
Customers.select(‘apartment’).distinct().show() |
+———+
|apartment|
+———+
| 1.0|
| 3.0|
| 2.0|
+———+
Customers.select(‘occupation’).distinct().show() |
+———-+
|occupation|
+———-+
| 1.0|
| 4.0|
| 3.0|
| 2.0|
+———-+
Customers.select(‘dependents’).distinct().show() |
+———-+
|dependents|
+———-+
| 1.0|
| 2.0|
+———-+
Customers.select(‘credits’).distinct().show() |
+——-+
|credits|
+——-+
| 1.0|
| 4.0|
| 3.0|
| 2.0|
+——-+
Customers.select(‘hasPhone’).distinct().show() |
+——–+
|hasPhone|
+——–+
| 1.0|
| 2.0|
+——–+
Customers.select(‘foreign’).distinct().show() |
+——-+
|foreign|
+——-+
| 1.0|
| 2.0|
+——-+
For the continuous variables, we can check the statistical summary by checking its mean, stddev, max, min and count.
Customers.select(‘age’).describe().show() |
+——-+——————+
|summary| age|
+——-+——————+
| count| 1000|
| mean| 35.542|
| stddev|11.352670131696735|
| min| 19.0|
| max| 75.0|
+——-+——————+
Customers.select(‘amount’).describe().show() |
+——-+——————+
|summary| amount|
+——-+——————+
| count| 1000|
| mean| 3271.248|
| stddev|2822.7517598956515|
| min| 250.0|
| max| 18424.0|
+——-+——————+
Since the goal of this analysis is to find the credibility of a customer using various attributes, let’s drive our EDA by keeping it creditability-oriented.
Let’s find out the nature of savings column towards the creditability i.e. count of customers who falls into various saving buckets and then among them how many are credible for the loan and how many are not.
savings_cube = Customers.cube(“creditability”, Customers.savings).count().orderBy(“creditability”, “savings”). \ withColumnRenamed(“count”,”cnt”) savings_cube.select(‘*’).filter(savings_cube.creditability.isNotNull()).show() |
+————-+——-+—+
|creditability|savings|cnt|
+————-+——-+—+
| 0.0| null|300|
| 0.0| 1.0|217|
| 0.0| 2.0| 34|
| 0.0| 3.0| 11|
| 0.0| 4.0| 6|
| 0.0| 5.0| 32|
| 1.0| null|700|
| 1.0| 1.0|386|
| 1.0| 2.0| 69|
| 1.0| 3.0| 52|
| 1.0| 4.0| 42|
| 1.0| 5.0|151|
+————-+——-+—+
As per the above table, the insights are as follows:
- Total number of people who are not creditable for the loan are : 300
- 1 : 217
- 2 : 34
- 3 : 11
- 4 : 6
- 5 : 32
- Total no of people creditable for the loan are : 700
- 1 : 386
- 2 : 69
- 3 : 52
- 4 : 42
- 5 : 151
People who are not credible for the loan are mostly those who do not have any savings.
The same information can be shown using pivot functionality of Spark with a concise code. This looks more presentable.
Customers.groupBy(“creditability”).pivot(“savings”).count().show() |
+————-+—+—+—+—+—+
|creditability|1.0|2.0|3.0|4.0|5.0|
+————-+—+—+—+—+—+
| 0.0|217| 34| 11| 6| 32|
| 1.0|386| 69| 52| 42|151|
+————-+—+—+—+—+—+
Similarly, we can see the summary of creditable people according to their account balance.
Customers.groupBy(“creditability”).pivot(“balance”).count().show() |
+————-+—+—+—+—+
|creditability|1.0|2.0|3.0|4.0|
+————-+—+—+—+—+
| 0.0|135|105| 14| 46|
| 1.0|139|164| 49|348|
+————-+—+—+—+—+
We can check the summary of people as per their payment history as well.
Customers.groupBy(“creditability”).pivot(“history”).count().show() |
+————-+—+—+—+—+—+
|creditability|0.0|1.0|2.0|3.0|4.0|
+————-+—+—+—+—+—+
| 0.0| 25| 28|169| 28| 50|
| 1.0| 15| 21|361| 60|243|
+————-+—+—+—+—+—+
We can also check for the summary of the purpose of loans, as shown below.
Customers.groupBy(“creditability”).pivot(“purpose”).count().show() |
+————-+—+—+—+—+—+—+—+—+—+—-+
|creditability|0.0|1.0|2.0|3.0|4.0|5.0|6.0|8.0|9.0|10.0|
+————-+—+—+—+—+—+—+—+—+—+—-+
| 0.0| 89| 17| 58| 62| 4| 8| 22| 1| 34| 5|
| 1.0|145| 86|123|218| 8| 14| 28| 8| 63| 7|
+————-+—+—+—+—+—+—+—+—+—+—-+
We can see the same stats based on occupation and the apartment.
Customers.groupBy(“creditability”).pivot(“occupation”).count().show() |
+————-+—+—+—+—+
|creditability|1.0|2.0|3.0|4.0|
+————-+—+—+—+—+
| 0.0| 7| 56|186| 51|
| 1.0| 15|144|444| 97|
+————-+—+—+—+—+
Customers.groupBy(“creditability”).pivot(“apartment”).count().show() |
+————-+—+—+—+
|creditability|1.0|2.0|3.0|
+————-+—+—+—+
| 0.0| 70|186| 44|
| 1.0|109|528| 63|
+————-+—+—+—+
We can do numerous statistical review of the data like shown above. In our next post, we will be looking at how to summarize with visualizations.
Hope this post has been helpful in understanding the various attributes of date. In the case of any queries, feel free to comment below and we will get back to you at the earliest.
Stay tuned to our blog for more posts on Big Data and other technologies.