Big Data Hadoop & SparkData Analytics with R, Excel & Tableau

Machine Learning with Spark – Part 3

In this post, we will be working on a dataset from a bank and try to find some patterns using Exploratory Data Analysis.
Before we go ahead let’s get acquainted with the data set.

Dataset:

Source Information
Professor Dr. Hans Hofmann
Institut f”ur Statistik und “Okonometrie
Universit”at Hamburg
FB Wirtschaftswissenschaften
Von-Melle-Park 5
2000 Hamburg 13

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

Use Case:

When a bank receives a loan application, based on the applicant’s profile, the bank has to make a decision regarding whether to go ahead with the loan approval or not. The two types of risks associated with the bank’s decision are:

  • If the applicant is a good credit risk, i.e. is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank.
  • If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank.

Objective of Analysis:

Minimization of Risk and Maximization of Profit on Behalf of the Bank.
To minimize loss from the bank’s perspective, the bank needs a decision rule regarding whom to give approval of the loan and whom not to. An applicant’s demographic and socio-economic profiles are considered by the loan managers before a decision is taken regarding his/her loan application.
The German Credit Data contains data for 1000 loan applicants, on 20 variables and the classification whether an applicant is considered to be a Good or a Bad credit risk.
A predictive model developed on this data is expected to provide the bank manager with a guidance for making a decision whether to approve a loan to a prospective applicant, based on their profile.
Attributes :

Creditability,
Account Balance,
Duration of Credit (month),
Payment Status of Previous Credit,
Purpose,
Credit Amount,
Value Savings/Stocks,
Length of current employment,
Instalment per cent,
Sex & Marital Status,
Guarantors,
Duration in Current address,
Most valuable available asset,
Age (years),
Concurrent Credits,
Type of apartment,
No of Credits at this Bank,
Occupation,No of dependents,
Telephone,
Foreign Worker

 
Let’s begin by loading the data into Spark .This is a structured dataset, so as per our initial discussion, dataFrame would be the best choice for such data.
We can load the data into data frame through two ways:

  1. Load it through RDD i.e. construct an RDD, define the schema and then convert it into a data frame.
raw_data_file=hdfs_project_root+’german_credit.csv’
from pyspark.sql import Row
# Read file from hdfsraw_data = sc.textFile(raw_data_file)# Read first line as header and replace spaces from themheader=raw_data.first().replace(‘ ‘,’_’)
# remove header from actual dataonly_data = raw_data.filter(lambda r : r.replace(‘ ‘,’_’) != header)
# By deafault everything is read as string,convert the data to double/floatdef to_type(data):return [float(i) for i in data] from pyspark.sql.types import *
#This line if you want the deafault header in the schema#fields = [StructField(field_name,DoubleType(), True) for field_name in header.split(‘,’)]#schema = StructType(fields)
# Changing the header as the default headers are lengthyschema = StructType([
StructField(‘creditability’, DoubleType(), True),
StructField(‘balance’, DoubleType(), True),
StructField(‘duration’, DoubleType(), True),
StructField(‘history’, DoubleType(), True),
StructField(‘purpose’, DoubleType(), True),
StructField(‘amount’, DoubleType(), True),
StructField(‘savings’, DoubleType(), True),
StructField(’employment’, DoubleType(), True),
StructField(‘instPercent’, DoubleType(), True),
StructField(‘sexMarried’, DoubleType(), True),
StructField(‘guarantors’, DoubleType(), True),
StructField(‘residenceDuration’, DoubleType(), True),
StructField(‘assets’, DoubleType(), True),
StructField(‘age’, DoubleType(), True),
StructField(‘concCredit’, DoubleType(), True),
StructField(‘apartment’, DoubleType(), True),
StructField(‘credits’, DoubleType(), True),
StructField(‘occupation’, DoubleType(), True),
StructField(‘dependents’, DoubleType(), True),
StructField(‘hasPhone’, DoubleType(), True),
StructField(‘foreign’, DoubleType(), True)
])
# Map the schema to the data and create data frame
Customers = sqlContext.createDataFrame(only_data, schema)

 

  1. Load the data through the spark_csv module
Customers = sqlContext.read.format(“com.databricks.spark.csv”) \
.option(“header”, “true”) \
.option(“inferschema”, “true”) \
.option(“mode”, “DROPMALFORMED”) \
.load(“./german_credit.csv”)

 
We can choose any of the above procedure, but spark_csv module would be better as we can skip all the boiler plate codes for parsing the columns and then separating the headers.
As of now, the schemas and headers are both inferred from the data. This is a very powerful feature of a data frame. However, in our data, the headers are too long, making it difficult to be used.
We can change the headers and the data types of the columns explicitly by defining the datatype and headers in StructField, enclosed within a StructType and then pass it to the CSV loader.

from pyspark.sql.types import *
schema = StructType([
StructField(‘creditability’, DoubleType(), True),
StructField(‘balance’, DoubleType(), True),
StructField(‘duration’, DoubleType(), True),
StructField(‘history’, DoubleType(), True),
StructField(‘purpose’, DoubleType(), True),
StructField(‘amount’, DoubleType(), True),
StructField(‘savings’, DoubleType(), True),
StructField(’employment’, DoubleType(), True),
StructField(‘instPercent’, DoubleType(), True),
StructField(‘sexMarried’, DoubleType(), True),
StructField(‘guarantors’, DoubleType(), True),
StructField(‘residenceDuration’, DoubleType(), True),
StructField(‘assets’, DoubleType(), True),
StructField(‘age’, DoubleType(), True),
StructField(‘concCredit’, DoubleType(), True),
StructField(‘apartment’, DoubleType(), True),
StructField(‘credits’, DoubleType(), True),
StructField(‘occupation’, DoubleType(), True),
StructField(‘dependents’, DoubleType(), True),
StructField(‘hasPhone’, DoubleType(), True),
StructField(‘foreign’, DoubleType(), True)
])
Customers = sqlContext.read.format(“com.databricks.spark.csv”) \
.option(“header”, “true”) \
.option(“inferschema”, “true”) \
.option(“mode”, “DROPMALFORMED”) \
.load(“./german_credit.csv”,schema=schema)

 
Once the data has been loaded as data frame we can check the schema as shown below.

Customers.printSchema()

 

root
|– creditability: double (nullable = true)
|– balance: double (nullable = true)
|– duration: double (nullable = true)
|– history: double (nullable = true)
|– purpose: double (nullable = true)
|– amount: double (nullable = true)
|– savings: double (nullable = true)
|– employment: double (nullable = true)
|– instPercent: double (nullable = true)
|– sexMarried: double (nullable = true)
|– guarantors: double (nullable = true)
|– residenceDuration: double (nullable = true)
|– assets: double (nullable = true)
|– age: double (nullable = true)
|– concCredit: double (nullable = true)
|– apartment: double (nullable = true)
|– credits: double (nullable = true)
|– occupation: double (nullable = true)
|– dependents: double (nullable = true)
|– hasPhone: double (nullable = true)
|– foreign: double (nullable = true)

 
Now, let’s do a quick check on the average account balance, credit amount and loan duration, as per the credibility.

# register the Customers frame as table
Customers.registerTempTable(“credit”)
# query the credability table to check average balance amount,average loan and average duration for
# each class of customer i.e. 1 and 0
results =  sqlContext.sql(“SELECT creditability, avg(balance) as avgbalance, avg(amount) as avgamt, \
avg(duration) as avgdur  FROM credit GROUP BY creditability “)
# check the result of the query
results.show()

 
+————-+——————+——————+——————+
|creditability|        avgbalance|            avgamt|            avgdur|
+————-+——————+——————+——————+
|          1.0|2.8657142857142857| 2985.442857142857|19.207142857142856|
|          0.0|1.9033333333333333|3938.1266666666666|             24.86|
+————-+——————+——————+——————+
 
Similarly, we can do a quick check on the statistical summary of all numerical columns in the data frame.

Customers.describe().show()


Hadoop
This becomes very difficult to identify the summary for each column. We can see the summary by separating them out.

Customers.describe(‘creditability’,’balance’,’duration’,’history’,’purpose’).show()
Customers.describe(“amount”,”savings”, “employment”, “instPercent”,”sexMarried”).show()
Customers.describe(“guarantors”,”residenceDuration”, “assets”, “age”, “concCredit”).show()
Customers.describe(“apartment”,”credits”, “occupation”, “dependents”, “hasPhone”).show()
Customers.describe(“foreign”).show()






Now we have the data loaded into our driver as a dataframe. In the next blog we would employ various EDA schemes to have a closer look into the data and also define the data dictionary of the data. Keep visiting our Acadgild for more updates on Big Data and other technologies. Click here to learn Apache Spark.
ISpark

Tags

Abhay Kumar

Abhay Kumar, lead Data Scientist – Computer Vision in a startup, is an experienced data scientist specializing in Deep Learning in Computer vision and has worked with a variety of programming languages like Python, Java, Pig, Hive, R, Shell, Javascript and with frameworks like Tensorflow, MXNet, Hadoop, Spark, MapReduce, Numpy, Scikit-learn, and pandas.

2 Comments

  1. Pingback: Mechine Learning with Spark | HadoopMinds

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close