Big Data Hadoop & SparkData Analytics with R, Excel & Tableau

Machine Learning with Spark – Part 5 : Determining Credibility of a Customer

It is said that a picture is better than thousand words. Saying this, let’s find out how we can use visualization to find out the different patterns from the data.
We will start by checking the data distributions on the continuous variables i.e. Age and Amount.
Let’s use Matplotlib library for the visualization.

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline  


ages =‘age’).rdd.flatMap(list).collect()
plt.hist(ages, bins=20, color=’lightblue’, normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(10, 6)

Similarly, for the amount values, we can check their distribution using a histogram.

ages =‘amount’).rdd.flatMap(list).collect()
plt.hist(ages, bins=20, color=’lightblue’, normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(10, 6)

Check Frequencies for Each Category of Balance Column:
Here we try to see the density of customers which lies in different sections of Account Balance.

balance = Customers.groupBy(‘balance’).count().rdd.collectAsMap()
plt.figure(), balance.values(), align=’center’)
plt.xticks(range(len(balance)), balance.keys())

As per the bar plot people with no balance in their account i.e. 1 and 2 are around 500,whereas people having less than 200 DM are around 50 whereas the majority of the customers have more than 200 DM in their account.
Check Frequencies for Each Category of History Column:
This sections tells us more about Payment Status of Previous Credit of each customer.

history = Customers.groupBy(‘history’).count().rdd.collectAsMap()
plt.figure(), history.values(), align=’center’)
plt.xticks(range(len(history)), history.keys())

Very few people i.e. make delayed payment or hold other credits. Majority of the customers have paid their credits. Whereas next majority are the customer with no problems in their payment history. Around 100 people are there who falls under no problem category.
Check Frequencies for Each Category of Purpose Column:
This section tells us about the motive of the customer for which they have taken loan. This desnsity is plotted as percentage unlike previous sections where we were dealing with frequencies.

purpose = Customers.groupBy(‘purpose’).count().rdd.collectAsMap()
d = [(c/float(sum(purpose.values())))*100 for c in purpose.values()]
print zip( purpose.keys(),d)
plt.figure(), d , align=’center’)
plt.xticks(range(len(purpose)), purpose.keys())

Around 22% customers take loan for other things,26 % for car(new,used).
Majority takes the loan for furniture around 27 %. Whereas 2 % take for Radio/TV,4% for Appliances,6% for Repairing,2% for vacation,10% for retraining and 3 % for Business.
Check Frequencies for Each Category of Savings Column:
This section tells us about the savings status of the customer.

savings = Customers.groupBy(‘savings’).count().rdd.collectAsMap()
d = [(c/float(sum(savings.values())))*100 for c in savings.values()]
print zip( savings.keys(),d)
plt.figure(), d , align=’center’)
plt.xticks(range(len(savings)), savings.keys())

Majority of them doesn’t have savings.10% has less than 100 DM as saving,17% has saving between 100 – 1000 DM and around 20 % have more than 200 DM as savings.
Check Frequencies for Each Category of Employment Column:
This tells us about the employment status of the customers.

employment = Customers.groupBy(’employment’).count().rdd.collectAsMap()
d = [(c/float(sum(employment.values())))*100 for c in employment.values()]
print zip( employment.keys(),d)
plt.figure(), d , align=’center’)
plt.xticks(range(len(employment)), employment.keys())

5% are Unemployed,around 16 % are employed for less than a year,majority are people who ahave work experience of around 1-4 years.16 % are in between 4 -7 years and 26% with experience of more than 7 years.
Check Frequencies for Each Category of instPercent Column:
This ections gives you information about the installment percentage for each category.

instPercent = Customers.groupBy(‘instPercent’).count().rdd.collectAsMap()
d = [(c/float(sum(instPercent.values())))*100 for c in instPercent.values()]
print zip( instPercent.keys(),d)
plt.figure(), d, align=’center’)
plt.xticks(range(len(instPercent)), instPercent.keys())

Majority has less than 20% installment interest.Around 12% has > 35%,22% has between 25-35% and 13% has 20-25%.
Check Frequencies for Each Category of ‘sexMarried’ Column:-
This tells us about the various category as per their marital status and gender.

sexMarried = Customers.groupBy(‘sexMarried’).count().rdd.collectAsMap()
d = [(c/float(sum(sexMarried.values())))*100 for c in sexMarried.values()]
print zip( sexMarried.keys(),d)
plt.figure(), d, align=’center’)
plt.xticks(range(len(sexMarried)), sexMarried.keys())

Majority are Male(Married/Widowed).30 % are single Men.9% are women.
Check Frequencies for Each Category of ‘guarantors’ Column:
This gives the count for each ‘guarantors’category.

guarantors = Customers.groupBy(‘guarantors’).count().rdd.collectAsMap()
plt.figure(), guarantors.values(), align=’center’)
plt.xticks(range(len(guarantors)), guarantors.keys())

Majority has no ‘guarantors’.
Check Frequencies for Each Category of ‘residenceDuration’ Column:
This tells us about the duration a customer has been residing at an address.

residenceDuration = Customers.groupBy(‘residenceDuration’).count().rdd.collectAsMap()
d = [(c/float(sum(residenceDuration.values())))*100 for c in residenceDuration.values()]
print zip( residenceDuration.keys(),d)
plt.figure(), d, align=’center’)
plt.xticks(range(len(residenceDuration)), residenceDuration.keys())

Majority are people who has been staying at the given address for more than 7 years.Most probably the address is their permanent address.
Check Frequencies for Each Category of ‘assets’ Column:
This gives us information about the most valuable asset that a customer has.

assets = Customers.groupBy(‘assets’).count().rdd.collectAsMap()
d = [(c/float(sum(assets.values())))*100 for c in assets.values()]
print zip( assets.keys(),d)
plt.figure(), d, align=’center’)
plt.xticks(range(len(assets)), assets.keys())

Majority have life insurance.23 % have car,27% has Nothing. Minority are having real estate.
Hope this post had been clear in explaining how to use visualization to find out the different patterns from the data. In the case of any queries, feel free to comment below and we will get back to you at the earliest.
In our next post, we will be learn about various Machine Learning classification algorithms and learn to develop models using those algorithms.

The upcoming topics would complete a data science project lifecycle. Keep visiting our website Acadgild for more updates on Spark and other technologies. Click here to learn Apache Spark.



Abhay Kumar

Abhay Kumar, lead Data Scientist – Computer Vision in a startup, is an experienced data scientist specializing in Deep Learning in Computer vision and has worked with a variety of programming languages like Python, Java, Pig, Hive, R, Shell, Javascript and with frameworks like Tensorflow, MXNet, Hadoop, Spark, MapReduce, Numpy, Scikit-learn, and pandas.

One Comment

  1. Pingback: Hot reads for this week in machine learning and deep learning – Everything Artificial Intelligence

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles