Big Data Hadoop & SparkData Analytics with R, Excel & Tableau

Machine Learning with Spark – Part 4 : Determining Credibility of a Customer

In our earlier posts of Machine Learning with Spark, we had seen how our data looks like, along with its headers. However, that description was not sufficient to provide a complete business view. To have a complete grasp of the problem, we should know every part of the attributes of the data.
Following is the mapping for each attribute’s numerical value to its actual categorical values. This gives us enough information about which attribute value corresponds to what significance in actual business.
Attributes:

Creditability :

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

  • 1 : Yes
  • 0 : No

Account Balance (DM i.e. Deutsche Mark,western Germany currency):

  • 1 : No Account
  • 2 : None
  • 3 : Below 200 DM
  • 4 : 200 DM or above

Duration of Credit (month): Continuous variable
Payment Status of Previous Credit :

  • 0 : Delayed
  • 1 : Other Credits
  • 2 : Paid
  • 3 : No problem
  • 4 : Previous credits cleared

Purpose of loan :

  • 1 : New car
  • 2 : Used car
  • 3 : Furniture
  • 4 : Radio/TV
  • 5 : Appliances
  • 6 : Repair
  • 8 : Vacation
  • 9 : Retraining
  • 10 : Business
  • 0 : Other

Credit Amount : Continuous variable
Value Savings/Stocks :

  • 1 : None
  • 2 : Below 100 DM
  • 3 : 100 – 500 DM
  • 4 : 500 – 1000 DM
  • 5 : > 1000 DM

Length of current employment :

  • 1 : Unemployed
  • 2 : < 1 Year
  • 3 : 1 – 4 Year
  • 4 : 4 – 7 Year
  • 5 : > 7 Year

Instalment per cent :

  • 1 : > 35%
  • 2 : 25% – 35%
  • 3 : 20% – 25%
  • 4 : < 20%

Sex & Marital Status :

  • 1 : Male(Divorced)
  • 2 : Male(Single)
  • 3 : Male(Married/Widowed)
  • 4 : Female

Guarantors:

  • 1 : None
  • 2 : Co-applicant
  • 3 : Guarantor

Duration in Current address:

  • 1 : < 1 Year
  • 2 : 1 – 4 Year
  • 3 : 4 – 7 Year
  • 4 : > 7 Year

Most valuable available asset :

  • 1 : None
  • 2 : Car
  • 3 : Life Insurance
  • 4 : Real Estate

Age (years) : Continuous variable
Concurrent Credits :

  • 1 : Other Banks
  • 2 : Dept Stores
  • 3 : None

Type of apartment :

  • 1 : Free
  • 2 : Rented
  • 3 : Owned

No of Credits at this Bank :

  • 1 : 1
  • 2 : 2 or 3
  • 3 : 4 or 5
  • 4 : More than 6

Occupation :

  • 1 : Unemployed,unskilled
  • 2 : Unskilled,permanent resident
  • 3 : Skilled
  • 4 : Executive

No of dependents:

  • 1 : >3
  • 2 : <3

Telephone :

  • 1 : Yes
  • 2 : No

Foreign Worker :

  • 1 : Yes
  • 2 : No

Now, let’s do a quick check on the average account balance, credit amount and loan duration as per the credibility.

# register the Customers frame as table
Customers.registerTempTable(“credit”)
# query the credability table to check average balance amount,average loan and average duration for
# each class of customer i.e. 1 and 0
results =  sqlContext.sql(“SELECT creditability, avg(balance) as avgbalance, avg(amount) as avgamt, \
                         avg(duration) as avgdur  FROM credit GROUP BY creditability “)
# check the result of the query
results.show()

+————-+——————+——————+——————+
|creditability|        avgbalance|            avgamt|            avgdur|
+————-+——————+——————+——————+
|          1.0|2.8657142857142857| 2985.442857142857|19.207142857142856|
|          0.0|1.9033333333333333|3938.1266666666666|             24.86|
+————-+——————+——————+——————+
Similarly, we can do a quick check on the statistical summary of all numerical columns in the data frame, as shown below.

Customers.describe().show()

Hadoop
This becomes very difficult to identify the summary for each column. Therefore, we can see the summary by separating them out.

Customers.describe(‘creditability’,’balance’,’duration’,’history’,’purpose’).show()
Customers.describe(“amount”,”savings”, “employment”, “instPercent”,”sexMarried”).show()
Customers.describe(“guarantors”,”residenceDuration”, “assets”, “age”, “concCredit”).show()
Customers.describe(“apartment”,”credits”, “occupation”, “dependents”, “hasPhone”).show()
Customers.describe(“foreign”).show()






The above stats does not give a complete view of the data because these values look all numerical but they all are not continuous. Hence, the above stats might not work on all of them.
Barring ‘creditability’, ’age’ and ‘amount’, every other column is categorical i.e. they are binned into pre-defined categories. You can check those categories from the above table.
The above stats of mean, median, devian, max, min will be good for continuous variables. For categorical variables, we will have to break them into cubes and then we can have a good look at their summary.
First, let’s take a look at the distinct categories in each attribute in the data.

Customers.select(‘creditability’).distinct().show()

+————-+
|creditability|
+————-+
|          0.0|
|          1.0|
+————-+

Customers.select(‘balance’).distinct().show()

+——-+
|balance|
+——-+
|    1.0|
|    4.0|
|    3.0|
|    2.0|
+——-+

Customers.select(‘history’).distinct().show()

+——-+
|history|
+——-+
|    0.0|
|    1.0|
|    4.0|
|    3.0|
|    2.0|
+——-+

Customers.select(‘purpose’).distinct().show()

+——-+
|purpose|
+——-+
|    8.0|
|    0.0|
|    1.0|
|    4.0|
|    3.0|
|    2.0|
|   10.0|
|    6.0|
|    5.0|
|    9.0|
+——-+

Customers.select(‘savings’).distinct().show()

+——-+
|savings|
+——-+
|    1.0|
|    4.0|
|    3.0|
|    2.0|
|    5.0|
+——-+

Customers.select(’employment’).distinct().show()

+———-+
|employment|
+———-+
|       1.0|
|       4.0|
|       3.0|
|       2.0|
|       5.0|
+———-+

Customers.select(‘instPercent’).distinct().show()

+———–+
|instPercent|
+———–+
|        1.0|
|        4.0|
|        3.0|
|        2.0|
+———–+

Customers.select(‘sexMarried’).distinct().show()

+———-+
|sexMarried|
+———-+
|       1.0|
|       4.0|
|       3.0|
|       2.0|
+———-+

Customers.select(‘guarantors’).distinct().show()

+———-+
|guarantors|
+———-+
|       1.0|
|       3.0|
|       2.0|
+———-+

Customers.select(‘residenceDuration’).distinct().show()

+—————–+
|residenceDuration|
+—————–+
|              1.0|
|              4.0|
|              3.0|
|              2.0|
+—————–+

Customers.select(‘concCredit’).distinct().show()

+———-+
|concCredit|
+———-+
|       1.0|
|       3.0|
|       2.0|
+———-+

Customers.select(‘apartment’).distinct().show()

+———+
|apartment|
+———+
|      1.0|
|      3.0|
|      2.0|
+———+

Customers.select(‘occupation’).distinct().show()

+———-+
|occupation|
+———-+
|       1.0|
|       4.0|
|       3.0|
|       2.0|
+———-+

Customers.select(‘dependents’).distinct().show()

+———-+
|dependents|
+———-+
|       1.0|
|       2.0|
+———-+

Customers.select(‘credits’).distinct().show()

+——-+
|credits|
+——-+
|    1.0|
|    4.0|
|    3.0|
|    2.0|
+——-+

Customers.select(‘hasPhone’).distinct().show()

+——–+
|hasPhone|
+——–+
|     1.0|
|     2.0|
+——–+

Customers.select(‘foreign’).distinct().show()

+——-+
|foreign|
+——-+
|    1.0|
|    2.0|
+——-+
For the continuous variables, we can check the statistical summary by checking its mean, stddev, max, min and count.

Customers.select(‘age’).describe().show()

+——-+——————+
|summary|               age|
+——-+——————+
|  count|              1000|
|   mean|            35.542|
| stddev|11.352670131696735|
|    min|              19.0|
|    max|              75.0|
+——-+——————+

Customers.select(‘amount’).describe().show()

+——-+——————+
|summary|            amount|
+——-+——————+
|  count|              1000|
|   mean|          3271.248|
| stddev|2822.7517598956515|
|    min|             250.0|
|    max|           18424.0|
+——-+——————+
Since the goal of this analysis is to find the credibility of a customer using various attributes, let’s drive our EDA by keeping it creditability-oriented.
Let’s find out the nature of savings column towards the creditability i.e. count of customers who falls into various saving buckets and then among them how many are credible for the loan and how many are not.

savings_cube = Customers.cube(“creditability”, Customers.savings).count().orderBy(“creditability”, “savings”). \
withColumnRenamed(“count”,”cnt”)
savings_cube.select(‘*’).filter(savings_cube.creditability.isNotNull()).show()

+————-+——-+—+
|creditability|savings|cnt|
+————-+——-+—+
|          0.0|   null|300|
|          0.0|    1.0|217|
|          0.0|    2.0| 34|
|          0.0|    3.0| 11|
|          0.0|    4.0|  6|
|          0.0|    5.0| 32|
|          1.0|   null|700|
|          1.0|    1.0|386|
|          1.0|    2.0| 69|
|          1.0|    3.0| 52|
|          1.0|    4.0| 42|
|          1.0|    5.0|151|
+————-+——-+—+
As per the above table, the insights are as follows:

  • Total number of people who are not creditable for the loan are : 300
    • 1 : 217
    • 2 : 34
    • 3 : 11
    • 4 : 6
    • 5 : 32
  • Total no of people creditable for the loan are : 700
    • 1 : 386
    • 2 : 69
    • 3 : 52
    • 4 : 42
    • 5 : 151

People who are not credible for the loan are mostly those who do not have any savings.
The same information can be shown using pivot functionality of Spark with a concise code. This looks more presentable.

Customers.groupBy(“creditability”).pivot(“savings”).count().show()

+————-+—+—+—+—+—+
|creditability|1.0|2.0|3.0|4.0|5.0|
+————-+—+—+—+—+—+
|          0.0|217| 34| 11|  6| 32|
|          1.0|386| 69| 52| 42|151|
+————-+—+—+—+—+—+
Similarly, we can see the summary of creditable people according to their account balance.

Customers.groupBy(“creditability”).pivot(“balance”).count().show()

+————-+—+—+—+—+
|creditability|1.0|2.0|3.0|4.0|
+————-+—+—+—+—+
|          0.0|135|105| 14| 46|
|          1.0|139|164| 49|348|
+————-+—+—+—+—+
We can check the summary of people as per their payment history as well.

Customers.groupBy(“creditability”).pivot(“history”).count().show()

+————-+—+—+—+—+—+
|creditability|0.0|1.0|2.0|3.0|4.0|
+————-+—+—+—+—+—+
|          0.0| 25| 28|169| 28| 50|
|          1.0| 15| 21|361| 60|243|
+————-+—+—+—+—+—+
We can also check for the summary of the purpose of loans, as shown below.

Customers.groupBy(“creditability”).pivot(“purpose”).count().show()

+————-+—+—+—+—+—+—+—+—+—+—-+
|creditability|0.0|1.0|2.0|3.0|4.0|5.0|6.0|8.0|9.0|10.0|
+————-+—+—+—+—+—+—+—+—+—+—-+
|          0.0| 89| 17| 58| 62|  4|  8| 22|  1| 34|   5|
|          1.0|145| 86|123|218|  8| 14| 28|  8| 63|   7|
+————-+—+—+—+—+—+—+—+—+—+—-+
We can see the same stats based on occupation and the apartment.

Customers.groupBy(“creditability”).pivot(“occupation”).count().show()

+————-+—+—+—+—+
|creditability|1.0|2.0|3.0|4.0|
+————-+—+—+—+—+
|          0.0|  7| 56|186| 51|
|          1.0| 15|144|444| 97|
+————-+—+—+—+—+

Customers.groupBy(“creditability”).pivot(“apartment”).count().show()

+————-+—+—+—+
|creditability|1.0|2.0|3.0|
+————-+—+—+—+
|          0.0| 70|186| 44|
|          1.0|109|528| 63|
+————-+—+—+—+
We can do numerous statistical review of the data like shown above.  In our next post, we will be looking at how to summarize with visualizations.

Hope this post has been helpful in understanding the various attributes of date. In the case of any queries, feel free to comment below and we will get back to you at the earliest. 
Stay tuned to our blog for more posts on Big Data and other technologies.

Spark

Abhay Kumar

Abhay Kumar, lead Data Scientist – Computer Vision in a startup, is an experienced data scientist specializing in Deep Learning in Computer vision and has worked with a variety of programming languages like Python, Java, Pig, Hive, R, Shell, Javascript and with frameworks like Tensorflow, MXNet, Hadoop, Spark, MapReduce, Numpy, Scikit-learn, and pandas.

One Comment

  1. Pingback: Hot reads for this week in machine learning and deep learning – Everything Artificial Intelligence

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close