Data Science and Artificial Intelligence

Xg Boost Case Study 2

Case Study on Human Activity Recognition using XgBoost model

Human Activity Recognition or HAR for short is the problem of predicting what a person is doing based on a trace of their movement using sensors.

Movements are often normal indoor activities such as standing, sitting, jumping, and going up stairs. Sensors are often located on the subject such as a smartphone or vest and often record accelerometer data in three dimensions (x, y, z).

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

The idea is that once the subject’s activity is recognized and known, an intelligent computer system can then offer assistance.

It is a challenging problem because there is no clear analytical way to relate the sensor data to specific actions in a general way. It is technically challenging because of the large volume of sensor data collected (e.g. tens or hundreds of observations per second) and the classical use of hand crafted features and heuristics from this data in developing predictive models

Problem Description

The dataset “Activity Recognition from Single Chest-Mounted Accelerometer Data Set” was collected and made available by Casale, Pujol et al. from the University of Barcelona in Spain. It is freely available from the UCI Machine Learning repository:

https://archive.ics.uci.edu/ml/datasets/Activity+Recognition+from+Single+Chest-Mounted+Accelerometer

The dataset is comprised of un-calibrated accelerometer data from 15 different subjects, each performing 7 activities. Each subject wore a custom-developed chest-mounted accelerometer and data was collected at 52 Hz(52 observations per second).

 

 

 

 

 

 

 

Data Description

Un-calibrated Accelerometer Data are collected from 15 participants performing 7 activities. The dataset provides challenges for identification and authentication of people using motion patterns.

Data Set Information:

— The dataset collects data from a wearable accelerometer mounted on the chest

— Sampling frequency of the accelerometer: 52 Hz

— Accelerometer Data are Un-calibrated

— Number of Participants: 15

— Number of Activities: 7

— Data Format: CSV

Attribute Information:

— Data are separated by participant

— Each file contains the following information

—- Sequential number, x acceleration, y acceleration, z acceleration, label

— Labels are codified by numbers

— 1: Working at Computer

— 2: Standing Up, Walking and Going updown stairs

— 3: Standing

— 4: Walking

— 5: Going UpDown Stairs

— 6: Walking and Talking with Someone

— 7: Talking while Standing

 

Data Reading

data_dir = '/home/ubuntu/Documents/LearnersHeaven/content_preparation/course_content/development/case_studies/ML/data/HumanActivityRecogmnition/Activity Recognition from Single Chest-Mounted Accelerometer_'

 

Import libraries and tools

from glob import glob
import pandas as pd
%matplotlib inline
all_data = glob(data_dir+"/*.csv")
all_data[:3]

Output:

[‘/home/ubuntu/Documents/LearnersHeaven/content_preparation/course_content/development/case_studies/ML/data/HumanActivityRecogmnition/Activity Recognition from Single Chest-Mounted Accelerometer_/4.csv’,

‘/home/ubuntu/Documents/LearnersHeaven/content_preparation/course_content/development/case_studies/ML/data/HumanActivityRecogmnition/Activity Recognition from Single Chest-Mounted Accelerometer_/2.csv’,

‘/home/ubuntu/Documents/LearnersHeaven/content_preparation/course_content/development/case_studies/ML/data/HumanActivityRecogmnition/Activity Recognition from Single Chest-Mounted Accelerometer_/8.csv’

 

Load data set

def load_dataset(all_data):
subjects = pd.DataFrame()
for i,name in enumerate(all_data):
df = pd.read_csv(name, header=None)
df['subject_id'] = i+1
subjects = subjects.append(df.iloc[:,1:])
return subjects
subjects_df = load_dataset(all_data)
subjects_df.columns = ['x', 'y', 'z', 'label','subject_id']
subjects_df.head()

Output:

 

 

 

 

 

print('Loaded %d subjects' % len(subjects_df.subject_id.unique()))

 

Output: Loaded 15 subjects

 

Plot a subject

from matplotlib import pyplot
# plot the x, y, z acceleration and activities for a single subject
def plot_subject(subject):
pyplot.figure()
for col in range(subject.shape[1]):
pyplot.subplot(subject.shape[1], 1, col+1)
pyplot.plot(subject[:,col])
pyplot.show()
# plot activities for a single subject
plot_subject(subjects_df[subjects_df.subject_id==1].iloc[:,:4].values)

 

 

 

 

 

 

Running the example creates a line plot for each variable for the first loaded subject. We can see some very large movement in the beginning of the sequence that may be an outlier or unusual behaviour that could be removed. We can also see that the subject performed some actions multiple times. For example, a closer look at the plot of the class variable (bottom plot) suggests the subject performed activities in the following order, 1, 2, 0, 3, 0, 4, 3, 5, 3, 6, 7. Note that activity 3 was performed twice.

Plot Total Activity Durations

subjects = []
for k,values in subjects_df.groupby('subject_id'):
subjects.append(values.iloc[:,:4].values)
#returns a list of dict, where each dict has one sequence per activity
def group_by_activity(subjects, activities):
grouped = [{a:s[s[:,-1]==a] for a in activities} for s in subjects]
return grouped
#calculate total duration in sec for each activity per subject and plot
def calculate_durations(grouped, activities):
#calculate the lengths for each activity for each subject
freq = 52
durations = [[len(s[a])/freq for s in grouped] for a in activities]
return durations
def plot_durations(grouped, activities):
durations = calculate_durations(grouped, activities)
pyplot.boxplot(durations, labels=activities)
pyplot.show()
#grouped
activities = [i for i in range(0,8)]
grouped = group_by_activity(subjects, activities)
#plot durations
plot_durations(grouped, activities)

 

 

 

 

 

 

We can see that there is relatively fewer observations for activities 0 (no activity), 2 (standing up, walking and going up/down stairs), 5 (going up/down stairs) and 6 (walking and talking). We can also see that each subject spent a lot of time on activity 1 (standing Up, walking and going up/down stairs) and activity 7 (talking while standing).

 

calculate_durations(grouped, activities)
#plot the x, y, z acceleration for each subject
def plot_subjects(subjects):
pyplot.figure()
#create a plot for each subject
xaxis = None
for i in range(len(subjects)):
ax = pyplot.subplot(len(subjects), 1, i+1, sharex=xaxis)
if i == 0:
xaxis = ax
#plot a histogram of x data
for j in range(subjects[i].shape[1]-1):
pyplot.hist(subjects[i][:,j], bins=100)
pyplot.show()
plot_subjects(subjects)

 

 

 

 

 

 

Running the example creates a single figure with 15 plots, one for each subject, and 3 histograms on each plot for each of the 3 axis of accelerometer data. The three colors blue, orange and green represent the x, y and z axes. This plot suggests that the distribution of each axis of accelerometer is Gaussian or really close to Gaussian. This may help with simple outlier detection and removal along each axis of the accelerometer data. The plot really helps to show both the relationship between the distributions within a subject and differences in the distributions between the subjects.

Within each subject, a common pattern is for the x (blue) and z (green) are grouped together to the left and y data (orange) is separate to the right. The distribution of y is often sharper whereas the distributions of x and z are flatter.

Across subjects, we can see a general clustering of values around 2,000 (whatever the units are), although with a lot of spread. This marked difference in distributions does suggest the need to at least standardize (shift to zero mean and unit variance) the data per axis and per subject before any cross-subject modelling is performed.

 

 

Defining the variables

X = subjects_df[['x','y','z']]
y = subjects_df['label']

Train data split

#evaluate the model by splitting into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=12)

 

Modeling

#calculate cross-validated AUC
from xgboost.sklearn import XGBClassifier
model = XGBClassifier(objective = 'multi:softprob',
colsample_bylevel = 0.7,
colsample_bytree = 0.8,
gamma = 1,
learning_rate = 0.15,
max_delta_step = 3,
max_depth = 4,
min_child_weight = 1,
n_estimators = 50,
reg_lambda = 10,
scale_pos_weight = 1.5,
subsample = 0.9,
silent = False,
n_jobs = 4
)
model.fit(X_train,y_train,  eval_set = [(X_train,y_train),(X_test, y_test)],
early_stopping_rounds = 20)
#use the model to make predictions with the test data
predicted = model.predict(X_test)
#Import metrics
from sklearn import metrics
#generate evaluation metrics-
print(metrics.accuracy_score(y_test, predicted))

Output:

0.6754176404546862

#Print out the confusion matrix
print(metrics.confusion_matrix(y_test, predicted))

Output:

[[     0    602      0     40    177      0      0    277] [     0 162305      1    577   4268      1     18  15650] [     0   7388      1    267   2739      0      4   4067] [     0   9127      0  16721  13064      0    134  25734] [     0  14552      3   3686  66171      3     45  22566] [     0   2484      1   2345   5955      0    101   4571] [     0   1007      0   2225   2550      0    566   8078] [     0  17018      0   5018  11042      0    246 144674]]

 

#Print out the classification report, and check the f1 score
print(metrics.classification_report(y_test, predicted))

Output:

precision    recall  f1-score   support

 

0       0.00      0.00      0.00      1096

1       0.76      0.89      0.82    182820

2       0.17      0.00      0.00     14466

3       0.54      0.26      0.35     64780

4       0.62      0.62      0.62    107026

5       0.00      0.00      0.00     15457

6       0.51      0.04      0.07     14426

7       0.64      0.81      0.72    177998

 

micro avg       0.68      0.68      0.68    578069

macro avg       0.40      0.33      0.32    578069

weighted avg       0.63      0.68      0.64    578069

 

 

Visualisation

from xgboost import plot_tree
import graphviz
from matplotlib import pyplot as plt
plot_tree(model, num_trees = 0)
fig = plt.gcf()
fig.set_size_inches(300, 100)
fig.savefig('tree.png')

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

#Find out the mean cross validation score/accuracy of the fitted model, use 5 cv steps
from sklearn.model_selection import cross_val_score
num_cv = 5
print("cross validated accuracy: ",cross_val_score(model, X, y, cv=num_cv, scoring='accuracy').mean())

 

 

prateek

An alumnus of the NIE-Institute Of Technology, Mysore, Prateek is an ardent Data Science enthusiast. He has been working at Acadgild as a Data Engineer for the past 3 years. He is a Subject-matter expert in the field of Big Data, Hadoop ecosystem, and Spark.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close