Data Science and Artificial Intelligence

XGBoost Algorithm

This entry is part 10 of 17 in the series Machine Learning Algorithms

This blog investigates one of the Popular Boosting Ensemble algorithm known as XGBoost. Regardless of the data type (regression or classification), it is renowned for providing better solutions than other ML algorithms.

Extreme Gradient Boosting (xgboost) is similar to gradient boosting framework but more efficient. It has both a linear model solver and tree learning algorithms. So, what makes it fast is its capacity to do parallel computation on a single machine.

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

This makes xgboost at least 10 times faster than existing gradient boosting implementations. It supports various objective functions, including regression, classification, and ranking.

Since it is very high in predictive power but relatively slow with implementation, “xgboost” becomes an ideal fit for many competitions. It also has additional features for doing cross-validation and finding important variables.

Idea of boosting

Let’s start with an intuitive definition of the concept:

Boosting (Freud and Shapire, 1996) – algorithm allowing to fit many weak classifiers to reweighted versions of the training data.

When using the boosting technique, all instance in a dataset is assigned a score that tells how difficult to classify they are. In each following iteration, the algorithm pays more attention (assign bigger weights) to instances that were wrongly classified previously.

In the first iteration, all instance weights are equal.

Ensemble parameters are optimized in a stagewise manner which means that we are calculating optimal parameters for the next classifier holding fixed what was already calculated. This might sound like a limitation but turns out to be a very reasonable way of regularizing the model.

Pro’s

  • Computational scalability,
  • Handling missing values,
  • Robust to outliers,
  • Does not require feature scaling,
  • Can deal with irrelevant inputs,
  • Interpretable (if small),
  • Can handle mixed predictors (quantitative and qualitative)

Con’s

  • Can’t extract the linear combination of features
  • Small predictive power (high variance)

Boosting technique can reduce the variance by averaging many different trees (where each one is solving the same problem)

How XGBoost helps

The problem with most tree packages is that they don’t take regularization issues very seriously – they allow to grow many very similar trees that can be also sometimes quite bushy.

The GBT approach to this problem is to add some regularization parameters. We can:

  • control tree structure (maximum depth, minimum samples per leaf),
  • control learning rate (shrinkage),
  • reduce variance by introducing randomness (stochastic gradient boosting – using random subsamples of instances and features)

But it could be improved even further using XGBoost.

XGBoost (extreme gradient boosting) is a more regularized version of Gradient Boosted Trees.

It was developed by Tianqi Chen in C++ but also enables interfaces for Python, R, Julia.

The main advantages:

  • good bias-variance (simple-predictive) trade-off “out of the box”,
  • great computation speed,
  • the package is evolving (the author is open to accept many PR from the community)

XGBoost’s objective function is a sum of a specific loss function evaluated overall predictions and a sum of regularization term for all predictors (KK trees).

Mathematically, it can be represented as:

XGBoost handles only numeric variables.

Problem Statement:

To build a simple boosting classification model called XGBoost, for predicting the quality of the car given few of other car attributes.

Data details

==========================================
1. Title: Car Evaluation Database
==========================================

The dataset is available at  “http://archive.ics.uci.edu/ml/datasets/Car+Evaluation

2. Sources:
   (a) Creator: Marko Bohanec
   (b) Donors: Marko Bohanec   ([email protected])
               Blaz Zupan ([email protected])
   (c) Date: June, 1997

3. Past Usage:

   The hierarchical decision model, from which this dataset is
   derived, was first presented in 

   M. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for
   multi-attribute decision making. In 8th Intl Workshop on Expert
   Systems and their Applications, Avignon, France. pages 59-78, 1988.

   Within machine-learning, this dataset was used for the evaluation
   of HINT (Hierarchy INduction Tool), which was proved to be able to
   completely reconstruct the original hierarchical model. This,
   together with a comparison with C4.5, is presented in

   B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by
   function decomposition. ICML-97, Nashville, TN. 1997 (to appear)

4. Relevant Information Paragraph:

   Car Evaluation Database was derived from a simple hierarchical
   decision model originally developed for the demonstration of DEX
   (M. Bohanec, V. Rajkovic: Expert system for decision
   making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates
   cars according to the following concept structure:

  CAR                      car acceptability
  PRICE                  overall price
  buying               buying price
  maint                price of the maintenance
  TECH                   technical characteristics
  COMFORT              comfort
  doors              number of doors
  persons            capacity in terms of persons to carry
  lug_boot           the size of luggage boot
  safety               estimated safety of the car

   Input attributes are printed in lowercase. Besides the target
   concept (CAR), the model includes three intermediate concepts:
   PRICE, TECH, COMFORT. 
5. Number of Instances: 1728 (instances completely cover the attribute space)

6. Number of Attributes: 6

7. Attribute Values:

   buying       v-high, high, med, low
   maint        v-high, high, med, low
   doors        2, 3, 4, 5-more
   persons      2, 4, more
   lug_boot     small, med, big
   safety       low, med, high

8. Missing Attribute Values: none

9. Class Distribution (number of instances per class)

   class      N N[%]
   -----------------------------
   unacc     1210 (70.023 %) 
   acc        384 (22.222 %) 
   good        69 ( 3.993 %) 
   v-good      65 ( 3.762 %)

Tools to be used:

Numpy, pandas, scikit-learn

Python Implementation with code:

0. Import necessary libraries

Import the necessary modules from specific libraries.

import os
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from sklearn import  metrics, model_selection
from xgboost.sklearn import XGBClassifier

1. Load the data set

Use pandas module to read the bike data from the file system. Check few records of the dataset.

data = 
pd.read_csv('data/car_quality/car.data',names=['buying','maint','doors','persons','lug_boot','safety','class'])
data.head()

  buying maint doors persons lug_boot safety class
0 vhigh  vhigh 2     2       small    low    unacc
1 vhigh  vhigh 2     2       small    med    unacc
2 vhigh  vhigh 2     2       small    high   unacc
3 vhigh  vhigh 2     2       med      low    unacc
4 vhigh  vhigh 2     2       med      med    unacc

2. Check some information about the data set

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null object
maint       1728 non-null object
doors       1728 non-null object
persons     1728 non-null object
lug_boot    1728 non-null object
safety      1728 non-null object
class       1728 non-null object
dtypes: object(7)
memory usage: 94.6+ KB

The train dataset has 1728 rows and 7 columns.

There are no missing values in the dataset.

3. Identify the target variable

data['class'],class_names = pd.factorize(data['class'])

The target variable is marked as a class in the data frame. The values are present in string format. However, the algorithm requires the variables to be coded into its equivalent integer codes. We can convert the string categorical values into an integer code using factorize method of the pandas library.

Let’s check the encoded values now.

print(class_names)
print(data['class'].unique())

Index([u'unacc', u'acc', u'vgood', u'good'], dtype='object')
[0 1 2 3]

As we can see the values has been encoded into 4 different numeric labels.

4. Identify the predictor variables and encode any string variables to equivalent integer codes

data['buying'],_ = pd.factorize(data['buying'])
data['maint'],_ = pd.factorize(data['maint'])
data['doors'],_ = pd.factorize(data['doors'])
data['persons'],_ = pd.factorize(data['persons'])
data['lug_boot'],_ = pd.factorize(data['lug_boot'])
data['safety'],_ = pd.factorize(data['safety'])
data.head()

   buying maint doors persons lug_boot safety class
0  0      0     0     0       0        0      0
1  0      0     0     0       0        1      0
2  0      0     0     0       0        2      0
3  0      0     0     0       1        0      0
4  0      0     0     0       1        1      0

Check the data types now :

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null int64
maint       1728 non-null int64
doors       1728 non-null int64
persons     1728 non-null int64
lug_boot    1728 non-null int64
safety      1728 non-null int64
class       1728 non-null int64
dtypes: int64(7)
memory usage: 94.6 KB

Everything is now converted in integer form.

5.Select the predictor feature and select the target variable

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

6.Train test split:

# split data randomly into 70% training and 30% test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3, random_state=123)

7.Training/model fitting

params = {
    'objective': 'binary:logistic',
    'max_depth': 2,
    'learning_rate': 1.0,
    'silent': 1.0,
    'n_estimators': 5
}

model = XGBClassifier(**params).fit(X_train, y_train)

8.Model parameters study :

# use the model to make predictions with the test data
y_pred = model.predict(X_test)
# how did our model perform?
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(y_test, y_pred)
print('Accuracy: {:.2f}'.format(accuracy))

Misclassified samples: 58
Accuracy: 0.89

The model actually has an 89% accuracy score, which is not bad at all. There you have it. That’s how to implement your first xgboost model with scikit-learn. Load your favorite dataset and give it a try!

Algorithm Advantages:

Parallel Computing: It is enabled with parallel processing (using OpenMP); i.e., when you run xgboost, by default, it would use all the cores of your laptop/machine.

Regularization: I believe this is the biggest advantage of xgboost. GBM has no provision for regularization. Regularization is a technique used to avoid overfitting in linear and tree-based models.

Enabled Cross Validation: In R, we usually use external packages such as caret and mlr to obtain CV results. But, xgboost is enabled with internal CV function (we’ll see below).

Missing Values: XGBoost is designed to handle missing values internally. The missing values are treated in such a manner that if there exists any trend in missing values, it is captured by the model.

Flexibility: In addition to regression, classification, and ranking problems, it supports user-defined objective functions also. An objective function is used to measure the performance of the model given a certain set of parameters. Furthermore, it supports user-defined evaluation metrics as well.

Availability: Currently, it is available for programming languages such as R, Python, Java, Julia, and Scala.

Save and Reload: XGBoost gives us a feature to save our data matrix and model and reload it later. Suppose, we have a large data set, we can simply save the model and use it in the future instead of wasting time redoing the computation.

Tree Pruning: Unlike GBM, where tree pruning stops once a negative loss is encountered, XGBoost grows the tree to a maximum depth of and then prune backward until the improvement in loss function is below a threshold.

Learn Python Programming online here at Acadgild.

Series Navigation<< Logistic RegressionK-Nearest Neighbor Algorithm >>

Abhay Kumar

Abhay Kumar, lead Data Scientist – Computer Vision in a startup, is an experienced data scientist specializing in Deep Learning in Computer vision and has worked with a variety of programming languages like Python, Java, Pig, Hive, R, Shell, Javascript and with frameworks like Tensorflow, MXNet, Hadoop, Spark, MapReduce, Numpy, Scikit-learn, and pandas.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close