Case Study on Donut Data Set using Linear Regression
The Donut data set was provided in an interview round and it was asked to solve the problem statement hands on. The data set can be accessed through the following link.
Reading in the data
Output: Index([‘Donut ID’, ‘Donut Estimator 1’, ‘Donut Area of cross section’,
‘Donut Area of circumference circle’,
‘Donut area of central hole / Donut Area of circumscribed circle’,
‘Donut Estimator 2’, ‘Donut Estimator 3’, ‘Donut Estimator 4’,
‘Donut Estimator 5’, ‘Donut volume Estimator 6’, ‘Location’,
‘Donut Density’, ‘Donut volume’],
dtype=’object’)
Handling categorical features
Visualizing the data
Output: [‘Donut Estimator 1’,
‘Donut Area of cross section’,
‘Donut Area of circumference circle’,
‘Donut area of central hole / Donut Area of circumscribed circle’,
‘Donut Estimator 2’,
‘Donut Estimator 3’,
‘Donut Estimator 4’,
‘Donut Estimator 5’,
‘Donut volume Estimator 6’,
‘Location_Texas’]
Check for multi-collinearity
Creating baseline with null RMSE
Null RMSE is the RMSE that could be achieved by always predicting the mean response value. It is a benchmark against which you may want to measure your regression model.
Output: The baseline guess for Donut volume is a score of 83.40
Baseline Performance on the test set for Donut volume: RMSE = 10.5856
Fitting on entire features
# define a function that accepts a list of features and returns testing RMSE
def train_test_rmse(train_df,test_df,feature_cols,target):
X = train_df[feature_cols] y = train_df[target] X_train, X_test, y_train, y_test = train_df[feature_cols] , test_df[feature_cols] , train_df[target] , test_df[target] linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
plot_line(y_test,y_pred,target)
return linreg,np.sqrt(metrics.mean_squared_error(y_test, y_pred))
# compare different sets of features
# ‘Donut Density’, ‘Donut volume’
# Donut Density
linreg,rmse_ = train_test_rmse(train_df,test_df,feature_cols,’Donut Density’)
print (rmse_)
#print(linreg.score(X_test, y_test))
# pair the feature names with the coefficients
list(zip(feature_cols, linreg.coef_))
Output: 4.205028275576086
[(‘Donut Estimator 1’, 0.019969286018910806),
(‘Donut Area of cross section’, 0.056824378116820666),
(‘Donut Area of circumference circle’, -0.011141124820625037),
(‘Donut area of central hole / Donut Area of circumscribed circle’,
18.37532899880506),
(‘Donut Estimator 2’, -0.013662198041061776),
(‘Donut Estimator 3’, 0.03211097048326648),
(‘Donut Estimator 4’, 0.0005844272728200844),
(‘Donut Estimator 5’, 0.001613921718232509),
(‘Donut volume Estimator 6’, -0.05474623086568598),
(‘Location_Texas’, -2.3923430721543717)]
Output: 9.625589368125372
[(‘Donut Estimator 1’, 0.015167756318128774),
(‘Donut Area of cross section’, 0.12148190231269254),
(‘Donut Area of circumference circle’, -0.06206356694138626),
(‘Donut area of central hole / Donut Area of circumscribed circle’,
49.63679964426164),
(‘Donut Estimator 2’, -0.016590844171362112),
(‘Donut Estimator 3’, 0.1104276320893024),
(‘Donut Estimator 4’, 0.004140928646805438),
(‘Donut Estimator 5’, 0.00040124955448831573),
(‘Donut volume Estimator 6’, -0.1388884821714789),
(‘Location_Texas’, -5.674542346646663)]
Check for missing values
Check and treat outliers
Output: 3.746191877394081
[(‘Donut Estimator 1’, 0.0288295162848659),
(‘Donut Area of cross section’, 0.03642288126891477),
(‘Donut Area of circumference circle’, -0.023677186635323165),
(‘Donut area of central hole / Donut Area of circumscribed circle’,
15.21302451820518),
(‘Donut Estimator 2’, 0.013458144257911404),
(‘Donut Estimator 3’, 0.016877294651321272),
(‘Donut Estimator 4’, -0.08072455883436068),
(‘Donut Estimator 5’, 0.08147306364736875),
(‘Donut volume Estimator 6’, -0.03985676958030972),
(‘Location_Texas’, -0.7118962569491505)]
Output: 9.36845277397271
[(‘Donut Estimator 1’, 0.02302115009651563),
(‘Donut Area of cross section’, 0.07538537877128039),
(‘Donut Area of circumference circle’, -0.07290185928673625),
(‘Donut area of central hole / Donut Area of circumscribed circle’,
38.60569659761433),
(‘Donut Estimator 2’, 0.0479577123217777),
(‘Donut Estimator 3’, 0.05132482421369051),
(‘Donut Estimator 4’, -0.14289021476649924),
(‘Donut Estimator 5’, 0.15684599831219803),
(‘Donut volume Estimator 6’, -0.0934921576017296),
(‘Location_Texas’, -1.7729050771242068)]
Treat multi-collinearity
Output: 6.069680214474494e-15
[(‘Donut Estimator 1’, -3.286933510054274e-17),
(‘Donut Area of cross section’, -7.979727989493313e-17),
(‘Donut Area of circumference circle’, 1.0581813203458523e-16),
(‘Donut area of central hole / Donut Area of circumscribed circle’,
-7.361732751176575e-17),
(‘Donut Estimator 2’, 7.178841397205427e-16),
(‘Donut Estimator 3’, 0.9999999999999999)]
Output: 9.255956005260917e-15
Out[30]:
[(‘Donut Estimator 1’, 2.39049709822129e-17),
(‘Donut Area of cross section’, -1.249000902703301e-16),
(‘Donut Area of circumference circle’, 1.5612511283791264e-16),
(‘Donut area of central hole / Donut Area of circumscribed circle’,
1.0000000000000004),
(‘Donut Estimator 2’, -9.774082584956822e-16),
(‘Donut Estimator 3’, 3.870601755773251e-17)]
 Conclusion: After treating outliers and multicollinearity we get almost perfect fit on out of bag samples.