View code at: https://github.com/apang782/vroom2

View other my other projects at: https://apang782.github.io/


Introduction

See the first part of this project at https://apang782.github.io/vroom1/

Here, scrapped data from vroom.com is used to create regression models – predicting used car prices. A large amount of possible predictor variables are tested and examined for inclusion in our models.

A good understanding of the importance of whether horsepower, fuel economy, etc influence prices is invaluable to both buyers and sellers of used cars. We aim to choose the best model/methodology of predicting car prices and to determine the relative importance of our variables.

The dataset used contains dummy variables, which indicate different types of vehicles through setting each level of the categorical variable as a column. The baselines of these dummy variables have been taken as the following:

To clarify, we indicate an electric vehicle for example, by having zeroes in the columns for flex fuel and diesel, but the column for electric is ‘1’. At the same time, to indicate a vehicle is an SUV, all columns dealing with body type are zero because our ‘base’ vehicle is understood as an SUV. Our categorical columns are meant to show the effects of difference from our baseline.


Conclusions

This project demonstrated the differences in both purpose and performance of various regression methods. We first see that the removal of outliers/high influence points do not necessarily improve model performance. From the stepwise and ridge regression models, we see that having more predictors does lead to lower error but can cause overfitting. Furthermore, the similarities between our OLS model and the LASSO regression model are noted, as well as LASSO’s improved performance.

At the same time, the power aggregate techniques is seen in their smaller MSE’s at the cost of simple/explainable structure.

From multiple (simple structure) models, we see that torque, mileage, and horsepower as being the most important factors in determining car price.

Random forests provided the best predictive power in this analysis – giving predictions that are off by around $2400.

For future consideration – new samples or bootstrapping (taking multiple samplings with replacement during modeling) can yield more accurate and reliable results in our nonensemble methods. This can overcome the possible internal validation bias present in our model and prevent overfitted models.


Reading in Data

library(car)
set.seed(1)
train.row <- sample(1:nrow(cars),nrow(cars)*(3/4),replace=FALSE)
train <- cars[train.row,]
test <- cars[-train.row,]
#dim(train) # 651 35
#dim(test) # 218 35

The data is read in via a CSV file – exported from earlier Python code. A 75/25 split is made in the data, leaving most of our observations for model training and a quarter for testing purposes. The dimensions of each set are noted.


Linear Regression Modeling – Variable Selection

## 
## Call:
## lm(formula = price ~ ., data = train20)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11843.8  -2793.0   -357.1   2023.6  20650.5 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -3.187e+04  6.182e+03  -5.156 3.37e-07 ***
## miles          -1.917e-01  1.431e-02 -13.390  < 2e-16 ***
## torque          6.891e+01  4.200e+00  16.405  < 2e-16 ***
## width           8.383e+01  4.040e+01   2.075 0.038390 *  
## gclear          3.608e+02  1.733e+02   2.082 0.037702 *  
## wheelb         -2.323e+02  5.263e+01  -4.415 1.19e-05 ***
## rtrackw         9.554e+02  1.339e+02   7.136 2.63e-12 ***
## body_Coupe      1.156e+04  4.459e+03   2.592 0.009752 ** 
## body_Wagon     -2.805e+03  6.207e+02  -4.520 7.39e-06 ***
## interior_Brown  4.055e+03  1.090e+03   3.720 0.000217 ***
## drive_4X4       3.703e+03  8.047e+02   4.602 5.05e-06 ***
## drive_AWD       2.719e+03  4.329e+02   6.280 6.28e-10 ***
## drive_RWD       2.689e+03  7.522e+02   3.574 0.000378 ***
## fuel_Flex.Fuel -6.997e+03  1.211e+03  -5.775 1.20e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4377 on 637 degrees of freedom
## Multiple R-squared:  0.7268, Adjusted R-squared:  0.7212 
## F-statistic: 130.4 on 13 and 637 DF,  p-value: < 2.2e-16
##          miles         torque          width         gclear         wheelb 
##       1.113863       3.032960       2.142638       1.633543       4.771267 
##        rtrackw     body_Coupe     body_Wagon interior_Brown      drive_4X4 
##       3.343053       1.036174       1.111891       1.026759       1.415737 
##      drive_AWD      drive_RWD fuel_Flex.Fuel 
##       1.273944       1.914480       1.268408

To avoid excessively long lists of outputs, I have condensed the variable selection process in the code above. I first modeled using all 34 predictors and found that fuel_Diesel and body_Pickup.Truck are linearlly related. I removed the variables and modeled again.

Because there are no aliased variables, I inspected the VIFs (variance inflation factors) of each variable. Predictors with high VIFs have high multicollinearity, or correlation with other predictors. The predictor with the highest VIF is removed and the remaining variables are used to model again. This process is repeated until no variable has a VIF greater than 10.

Following this, the summary() output of each model is checked. The Pr(>|t|) column in particular is used to see whether the relationship between each predictor and the response is due to chance. The column represents the p-value, where for our chosen alpha value of 0.05, a p-value smaller than 0.05 suggests the relationship is not due to chance.

Using this, the procedure of inspecting and dropping the variable with the greatest p-value is performed repeatedly until satisfactory.

Finally, the VIF is inspected one last time. I decided to drop one more variable in an effort for a more parsimonous model. The resulting model has predictors with VIFs less than 5.

I now inspect whether this model satisfies the conditions for a linear regression model:


Diagnostic Plots of Linear Model

par(mfrow = c(2,2))
plot(model20)

par(mfrow = c(1,1))
hist(resid(model20))

shapiro.test(resid(model20))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model20)
## W = 0.96761, p-value = 8.383e-11

The above diagnostic plots give insight to the model’s behavior.

We can see that the model is satisfactory for the linearity and equal variance conditions from the residuals vs fitted plot. There is no clear pattern to the distribution of the residuals as they all seem to be randomly scattered around the 0 line. There is suggestion of outliers as well as slight clustering to the left.

Independence of the errors are assumed in this model, as the dates/locations of the observations are unknown. This means we cannot establish whether there is a temporal or spatial relation in our data.

The normal QQ plot shows that our errors are not completely normally distributed. The right tail of the plot deviates significantly from the diagonal line. This is further supported by the histogram of residuals and the results of the Shapiro-Wilk test. The histogram does not show a normal distribution and the low p-value rejects the null hypothesis of normally distributed residuals.

To remedy these problems, we will use the Box Cox method to appropriately transform the response variable.


Transforming Linear Model & Inspection

library(MASS)
boxcox(model20)

tmodel <- lm(log(price)~.,train20)
par(mfrow = c(2,2))
plot(tmodel)

par(mfrow = c(1,1))
hist(resid(tmodel))

shapiro.test(resid(tmodel)) #actually normal!
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(tmodel)
## W = 0.99552, p-value = 0.05717
summary(tmodel)
## 
## Call:
## lm(formula = log(price) ~ ., data = train20)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48696 -0.10913 -0.00343  0.09448  0.60214 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.649e+00  2.262e-01  33.813  < 2e-16 ***
## miles          -7.602e-06  5.238e-07 -14.514  < 2e-16 ***
## torque          2.587e-03  1.537e-04  16.833  < 2e-16 ***
## width           3.498e-03  1.478e-03   2.366 0.018261 *  
## gclear          1.812e-02  6.341e-03   2.857 0.004415 ** 
## wheelb         -5.937e-03  1.926e-03  -3.082 0.002142 ** 
## rtrackw         3.497e-02  4.900e-03   7.138 2.60e-12 ***
## body_Coupe      4.366e-01  1.632e-01   2.676 0.007642 ** 
## body_Wagon     -9.484e-02  2.272e-02  -4.175 3.39e-05 ***
## interior_Brown  1.475e-01  3.989e-02   3.699 0.000235 ***
## drive_4X4       9.801e-02  2.945e-02   3.329 0.000923 ***
## drive_AWD       9.570e-02  1.584e-02   6.041 2.61e-09 ***
## drive_RWD       6.527e-02  2.753e-02   2.371 0.018029 *  
## fuel_Flex.Fuel -3.165e-01  4.433e-02  -7.139 2.57e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1602 on 637 degrees of freedom
## Multiple R-squared:  0.7471, Adjusted R-squared:  0.742 
## F-statistic: 144.8 on 13 and 637 DF,  p-value: < 2.2e-16

Based on the output of the Box Cox plot, we take lambda to be 0 – meaning a log transformation of the price response variable.

Using the same diagnostic methods, we see that the conditions for linear regression are met. Residuals appear to be much more well behaved. While it appears the normality condition is not fully satisfied based on the QQ plot and histogram of residuals, the Shapiro Wilk test confirms the normality of our residuals.


Outliers and Influential Points

#sum(hatvalues(tmodel)) # should match p = 14
#hatids <- which(abs(hatvalues(tmodel))>0.0645) #13 obs
#hatvalues(tmodel)[hatids]

#dfids <- which(dffits(tmodel)>0.30715) #11 obs
#dffits(tmodel)[dfids]

#sdrids <- which(abs(rstudent(tmodel))>3) #3 obs
#rstudent(tmodel)[sdrids]
#rstudent(tmodel)

#influenceIndexPlot(tmodel,id=TRUE)

cookids <- which(cooks.distance(tmodel)>1)
cookids1 <- which(cooks.distance(tmodel)>.5)
cooks.distance(tmodel)[cookids] #none!
## named numeric(0)
cooks.distance(tmodel)[cookids1] #also none
## named numeric(0)

This model contains 13 predictors, meaning p = 14 (counting intercept). There are 651 observations making up our training set, making n = 651. Therefore, the cutoffs for high leverage values are: 314/651 = 0.0645 (=3p/n) or alternatively: 214/651 = 0.04301. (=2p/n) as stricter cutoff.

The cutoff for DFFITS is calculated as: 2*sqrt((p+1)/(n-p-1)) = 0.30715 DFFITS is another method of identifying high influence observations (outlying observations on the x axis).

The cutoff for Cook’s distance is 0.5 and 1, along with any standout values. Cook’s distance is yet another method for determining high influence data. There are no observations deemed as influential using this method in our model.

Cutoff for studentized deleted residuals is 3. Observations 442, 500, and 544 have |t_i| > 3, and are considered to be outliers (outlying observations on the y axis).

Excessively long output have been shortened here. Observations 442, 500, and 544 will be removed due to their repeated appearance in these diagnoses and the model’s performance will be tested again. It is important to note that observations should not be removed simply because they do not fit our preconcieved notions of how the data should be modeled.


Testing Model with Removed Points

removed <- train20[-c(which(abs(rstudent(tmodel))>3)),]

rmodel <- lm(log(price)~.,removed)
summary(rmodel)
## 
## Call:
## lm(formula = log(price) ~ ., data = removed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48185 -0.10800 -0.00164  0.09662  0.43312 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.645e+00  2.232e-01  34.254  < 2e-16 ***
## miles          -7.958e-06  5.215e-07 -15.261  < 2e-16 ***
## torque          2.636e-03  1.514e-04  17.415  < 2e-16 ***
## width           4.131e-03  1.458e-03   2.834 0.004747 ** 
## gclear          1.927e-02  6.234e-03   3.091 0.002080 ** 
## wheelb         -6.476e-03  1.895e-03  -3.417 0.000673 ***
## rtrackw         3.505e-02  4.830e-03   7.256 1.17e-12 ***
## body_Coupe      4.354e-01  1.603e-01   2.716 0.006780 ** 
## body_Wagon     -9.485e-02  2.232e-02  -4.249 2.47e-05 ***
## interior_Brown  1.448e-01  3.919e-02   3.694 0.000240 ***
## drive_4X4       9.227e-02  2.896e-02   3.186 0.001515 ** 
## drive_AWD       9.295e-02  1.558e-02   5.965 4.06e-09 ***
## drive_RWD       6.281e-02  2.706e-02   2.321 0.020617 *  
## fuel_Flex.Fuel -3.198e-01  4.356e-02  -7.343 6.44e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1573 on 635 degrees of freedom
## Multiple R-squared:  0.7556, Adjusted R-squared:  0.7506 
## F-statistic:   151 on 13 and 635 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(rmodel)

par(mfrow = c(1,1))
hist(resid(rmodel))

shapiro.test(resid(rmodel))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(rmodel)
## W = 0.99612, p-value = 0.1106

We note that the p-value of the Shapiro-Wilk test and adjusted R-squared value increased, but other aspects of the model remain largely the same or even slightly worse. The variable drive_4X4 is deemed less significant than before, and new influential points/outliers seem to have manifested. The performances of the models will be compared later.

We currently have two ‘complete’ models: the base transformed model without removed points, and the same model fitted with points removed.


Forwards and Backwards Stepwise Regression Using AIC

#forwards
mod0 = lm(price~1,train)
modup = lm(price~.,train)
#step(mod0,scope=list(lower=mod0,upper=modup))
fmod = lm(price ~ torque + miles + fuel_Flex.Fuel + hp + drive_AWD + 
    ftrackw + wheelb + rtrackw + drive_RWD + drive_4X4 + interior_Brown + 
    body_Coupe + width + body_Wagon + fuel_Gas.Electric.Hybrid + 
    tm_Manual + hmpg + fuel_Electric + cmpg + gclear + height + 
    body_Van.Minivan, data = train)

#backwards
#stepAIC(modup,direction="backward")
bmod = lm(price ~ miles + hp + torque + height + width + gclear + 
    wheelb + ftrackw + rtrackw + cmpg + hmpg + body_Coupe + body_Van.Minivan + 
    body_Wagon + interior_Brown + tm_Manual + drive_4X4 + drive_AWD + 
    drive_RWD + fuel_Electric + fuel_Flex.Fuel + fuel_Gas.Electric.Hybrid, 
    data = train)

summary(fmod)
## 
## Call:
## lm(formula = price ~ torque + miles + fuel_Flex.Fuel + hp + drive_AWD + 
##     ftrackw + wheelb + rtrackw + drive_RWD + drive_4X4 + interior_Brown + 
##     body_Coupe + width + body_Wagon + fuel_Gas.Electric.Hybrid + 
##     tm_Manual + hmpg + fuel_Electric + cmpg + gclear + height + 
##     body_Van.Minivan, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11712.4  -2579.1   -226.6   1924.8  18547.3 
## 
## Coefficients: (1 not defined because of singularities)
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -4.667e+04  8.062e+03  -5.789 1.12e-08 ***
## torque                    3.367e+01  6.189e+00   5.441 7.61e-08 ***
## miles                    -2.055e-01  1.348e-02 -15.244  < 2e-16 ***
## fuel_Flex.Fuel           -7.627e+03  1.555e+03  -4.905 1.19e-06 ***
## hp                        5.124e+01  7.155e+00   7.161 2.24e-12 ***
## drive_AWD                 4.002e+03  4.428e+02   9.036  < 2e-16 ***
## ftrackw                   3.382e+03  3.917e+02   8.634  < 2e-16 ***
## wheelb                   -2.804e+02  5.200e+01  -5.391 9.92e-08 ***
## rtrackw                  -1.928e+03  3.603e+02  -5.353 1.22e-07 ***
## drive_RWD                 4.235e+03  7.309e+02   5.794 1.09e-08 ***
## drive_4X4                 4.388e+03  7.627e+02   5.754 1.36e-08 ***
## interior_Brown            3.955e+03  9.944e+02   3.977 7.78e-05 ***
## body_Coupe                8.684e+03  4.074e+03   2.132 0.033405 *  
## width                     7.219e+01  3.950e+01   1.828 0.068096 .  
## body_Wagon               -1.696e+03  5.880e+02  -2.885 0.004051 ** 
## fuel_Gas.Electric.Hybrid  7.888e+03  4.539e+03   1.738 0.082696 .  
## tm_Manual                        NA         NA      NA       NA    
## hmpg                     -3.242e+02  1.203e+02  -2.695 0.007217 ** 
## fuel_Electric            -3.682e+04  7.729e+03  -4.764 2.36e-06 ***
## cmpg                      5.863e+02  1.346e+02   4.357 1.54e-05 ***
## gclear                    8.397e+02  2.460e+02   3.413 0.000684 ***
## height                   -3.210e+02  8.376e+01  -3.832 0.000140 ***
## body_Van.Minivan          4.226e+02  1.264e+03   0.334 0.738248    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3978 on 629 degrees of freedom
## Multiple R-squared:  0.7772, Adjusted R-squared:  0.7697 
## F-statistic: 104.5 on 21 and 629 DF,  p-value: < 2.2e-16

An alternative method to linearly model used car data is used here. Again, due to excessively long outputs, some outputs have been surpressed.

Here, I have used stepwise regression in both forwards and backwards directions to select variables for use in modeling. Both forwards and backwards methods have resulted in the same model despite the surpression effects of the forward method. These methods choose the model with the lowest AIC.

It should be noted that all of the variables from our first model are contained here.


Inspecting Stepwise Model

par(mfrow = c(2,2))
plot(fmod)

par(mfrow = c(1,1))
hist(resid(fmod))

shapiro.test(resid(fmod))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fmod)
## W = 0.97185, p-value = 7.378e-10
boxcox(fmod)

We note that this model does not satisfy the conditions of linear regression. The proper transformation of the response variable is determined to be logarithmic.

tfmod = lm(log(price) ~ torque + miles + fuel_Flex.Fuel + hp + drive_AWD + 
    ftrackw + wheelb + rtrackw + drive_RWD + drive_4X4 + interior_Brown + 
    body_Coupe + width + body_Wagon + fuel_Gas.Electric.Hybrid + 
    tm_Manual + hmpg + fuel_Electric + cmpg + gclear + height + 
    body_Van.Minivan, data = train)

par(mfrow = c(2,2))
plot(tfmod)

par(mfrow = c(1,1))
hist(resid(tfmod))

shapiro.test(resid(tfmod))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(tfmod)
## W = 0.99562, p-value = 0.06428

The model is viable after transformation. It will be compared to the others later. We now have three models for comparison:


Ridge & LASSO Regression

Ridge and lasso regression are shrinkage methods, with ridge shrinking coefficients towards zero and lasso being able to zero coefficients. This effectively increases our prediction accuracy and can decrease variance while maintaining or improving how interpretable the model is.

Both use a tuning parameter – lambda. This is essentially a mathematical penalty for the models, where larger lambdas decrease variance without affecting the bias in ridge regression and can outright eliminate variables in lasso regression.


LASSO Regression

library(glmnet)
trainx <- data.matrix(subset(train,select=-price))
trainy <- train$price

testx <- data.matrix(subset(test,select=-price))
testy <- test$price


lambs <- seq(10,100000,10)
lasso_cv <- cv.glmnet(trainx,trainy,alpha=1,lambda=lambs,nfolds=10)
plot(lasso_cv)

#lasso_cv$lambda.min
#lasso_cv$lambda.1se
#choose lambda ~150

Ridge and lasso regression share the same glmnet function, with alpha = 1 indicating lasso regression is being performed.

The above plot displays the use of cross-validation, a resampling technique to estimate model performance. A range of potential lambda values are evaluated and plotted, giving a range of combinations of lambda values, variables, and MSEs to choose from. We use 10-fold CV to select our lambda.

Though the preferred lambda value is 10-40, we opt for a lambda around 150 in pursuit of a more streamlined model at the expense of slightly higher MSE.

# Fit model on training set
lassomod <- glmnet(trainx,trainy,alpha=1,lambda=150)
coefficients(lassomod)
## 35 x 1 sparse Matrix of class "dgCMatrix"
##                                     s0
## (Intercept)              -3.382839e+04
## year                      .           
## miles                    -1.934094e-01
## owners                    1.971561e+02
## hp                        2.535760e+01
## torque                    4.530229e+01
## height                   -8.220024e+00
## length                    .           
## width                     1.437925e+01
## gclear                    2.612911e+02
## wheelb                   -8.778639e+01
## ftrackw                   8.333317e+02
## rtrackw                   .           
## cmpg                      .           
## hmpg                      4.114346e+00
## body_Coupe                7.311090e+03
## body_Hatchback            .           
## body_Pickup.Truck         .           
## body_Sedan                .           
## body_Van.Minivan         -2.665587e+03
## body_Wagon               -1.662853e+03
## interior_Brown            3.652900e+03
## interior_Gray            -5.467646e+02
## interior_Red              4.282052e+03
## interior_Tan              1.918726e+02
## interior_White            .           
## tm_Manual                 .           
## tm_Other                  .           
## drive_4X4                 2.839562e+03
## drive_AWD                 2.604422e+03
## drive_RWD                 1.637347e+03
## fuel_Diesel               6.382511e+02
## fuel_Electric             .           
## fuel_Flex.Fuel           -5.715313e+03
## fuel_Gas.Electric.Hybrid  1.194834e+04

We see that the lasso regression has zeroed out some variables, effectively choosing variables as it was meant to do. Its performance will be compared to the other models later. The LASSO model is expected to behave similary to our ordinary least squares model and have better performance. An interesting fact to note is that when lambda is set to zero, the result is essentially our OLS model. The larger lambda we have selected, on the other hand, introduces enough shrinkage to mimic a best subset selection of variables.


Ridge Regression

ridge_cv <- cv.glmnet(trainx,trainy,alpha=0,lambda=lambs,nfolds=10)
plot(lasso_cv)

#ridge_cv$lambda.min
#ridge_cv$lambda.1se
#choose lambda ~400

Alpha is set to zero here in glmnet(), indicating ridge regression. The same 10-fold cross validation method is used to select our lambda parameter. Lambda is taken to be 400 here to achieve a balance between performance and sparsity of variables in our model.

# Fit model on training set
ridgemod <- glmnet(trainx,trainy,alpha=0,lambda=400)
coefficients(ridgemod)
## 35 x 1 sparse Matrix of class "dgCMatrix"
##                                     s0
## (Intercept)              -3.785116e+05
## year                      1.667145e+02
## miles                    -1.936290e-01
## owners                    6.824892e+02
## hp                        3.417351e+01
## torque                    3.573632e+01
## height                   -1.648215e+02
## length                   -5.634409e+00
## width                     6.195586e+01
## gclear                    5.669821e+02
## wheelb                   -1.709349e+02
## ftrackw                   1.148981e+03
## rtrackw                   8.341717e+00
## cmpg                      2.952788e+01
## hmpg                      4.083597e+01
## body_Coupe                8.975116e+03
## body_Hatchback           -1.100707e+03
## body_Pickup.Truck         4.326550e+03
## body_Sedan               -7.997460e+02
## body_Van.Minivan         -2.008922e+03
## body_Wagon               -2.179818e+03
## interior_Brown            4.047337e+03
## interior_Gray            -8.818264e+02
## interior_Red              5.420834e+03
## interior_Tan              6.537399e+02
## interior_White            6.357403e+01
## tm_Manual                 .           
## tm_Other                  2.063045e+02
## drive_4X4                 3.808631e+03
## drive_AWD                 2.939703e+03
## drive_RWD                 3.286589e+03
## fuel_Diesel               4.401350e+03
## fuel_Electric            -4.333819e+03
## fuel_Flex.Fuel           -5.639641e+03
## fuel_Gas.Electric.Hybrid  1.627840e+04

We note much more variables are present here than in lasso regression. Diesel and Pickup Truck are dropped because they are linearly related variables. Apart from them, no other variables have been dropped.

Now, there are a total of five models for comparison:


Evaluation and Comparison of Linear Models

##      model train.rmse test.rmse
## 1     base    4432.15   4787.44
## 2  removed    4432.47   4795.76
## 3 stepwise    3845.22   4709.17
## 4    lasso     4119.7    4750.1
## 5    ridge    3979.86   4990.03

The RMSE is used as our method of inspection, as it gives the dollar value by which our models’ are off by. The train column is the error in predicting values from our training observations, likewise for the test column for testing observations. The test column is of particular interest because the model has never ‘seen’ these observations, giving a good measure of the real-world performance of our models. But more importantly, the differences in the two columns give insight to the variance-bias tradeoff in our models.

In our base model, we see that both training and test RMSEs are comparatively high to other models. The similar values suggest that our model leans towards high bias and low variance, which is expected for this OLS linear model.

We first note that by removing data values, our base model’s performance actually decreased. This indicates that for future modeling and consideration, those observations should not be removed.

The stepwise model contains much more variables (22) than the 13 predictors used in our base model. From the larger difference between train and test RMSEs, we can assume this model leans towards having lower bias but higher variance. This is expected of having a more complex model.

The LASSO model also contains a large number of variables (21). This can be changed easily by using a higher lambda value. We should note the larger difference between train and test RMSEs, indicating a relative middle ground in model complexity, bias, and variance. In our case, however, LASSO performs slightly worse than the stepwise model.

The ridge regression model has the same trend seen in the stepwise model, with a larger difference in train and test RMSEs. This model also has the worst performance, suggesting the presence of overfitting when considering the model’s complexity.

It should be noted that relationships are rarely strictly linear, so there is almost certainly underlying bias in all of these models. Overfitting could also have resulted, as seen in our models with high numbers of predictors – enforcing internal validation bias.


Regression Tree

Decision trees are not as powerful as some of the techniques used in this project. They do provide some of the best insights and interpretation available for our data. Because of their nature, they have high variance and low bias.

#reg tree
set.seed(1)
library(rpart)
library(rpart.plot)
treefit <- rpart(price~.,train,method="anova")
rpart.plot(treefit)

printcp(treefit)
## 
## Regression tree:
## rpart(formula = price ~ ., data = train, method = "anova")
## 
## Variables actually used in tree construction:
## [1] cmpg   gclear hp     miles  torque year  
## 
## Root node error: 4.4667e+10/651 = 68613217
## 
## n= 651 
## 
##          CP nsplit rel error  xerror     xstd
## 1  0.396966      0   1.00000 1.00528 0.060677
## 2  0.099293      1   0.60303 0.61783 0.040979
## 3  0.083448      2   0.50374 0.51378 0.034927
## 4  0.045916      3   0.42029 0.43338 0.031047
## 5  0.034816      4   0.37438 0.41175 0.030818
## 6  0.024121      5   0.33956 0.39731 0.029282
## 7  0.022902      6   0.31544 0.36725 0.027451
## 8  0.021085      7   0.29254 0.35308 0.027160
## 9  0.014736      8   0.27145 0.32813 0.023685
## 10 0.014483      9   0.25672 0.33537 0.023580
## 11 0.010000     10   0.24223 0.32346 0.022660
treefit$variable.importance
##           torque               hp             cmpg           wheelb 
##      25647379930      20380913309      15389528603      13098605849 
##          ftrackw          rtrackw             hmpg            miles 
##      12905608096      11461875961       5357679299       3928377954 
##           height            width           gclear             year 
##       3694248795       2695328655       2265313036       1264272899 
##           length   fuel_Flex.Fuel body_Van.Minivan        drive_4X4 
##        628634558        590840609        382308629         44432916
plotcp(treefit)

We see from the first split that the tree considers torque as the most important variable in determining car price. From the variable importance output, the most important variables are torque, followed by horsepower, city mpg, etc. The variables actually used are cmpg, gclear, hp, miles, torque, and year.

From the complexity parameter plot, there appears to be diminishing returns around 8 nodes, noting the relative error and complexity parameter moving forward.

We choose a complexity parameter of 0.021085 to prune the tree.

treefit2 <- rpart(price~.,train,method="anova", cp=0.021085)
rpart.plot(treefit2)

printcp(treefit2)
## 
## Regression tree:
## rpart(formula = price ~ ., data = train, method = "anova", cp = 0.021085)
## 
## Variables actually used in tree construction:
## [1] cmpg   gclear hp     miles  torque year  
## 
## Root node error: 4.4667e+10/651 = 68613217
## 
## n= 651 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.396966      0   1.00000 1.00319 0.060572
## 2 0.099293      1   0.60303 0.63264 0.042046
## 3 0.083448      2   0.50374 0.54396 0.036581
## 4 0.045916      3   0.42029 0.44682 0.032621
## 5 0.034816      4   0.37438 0.42565 0.030559
## 6 0.024121      5   0.33956 0.40461 0.028071
## 7 0.022902      6   0.31544 0.40425 0.027944
## 8 0.021085      7   0.29254 0.38580 0.026812
## 9 0.021085      8   0.27145 0.37972 0.026845
treefit2$variable.importance
##           torque               hp             cmpg           wheelb 
##      25647379930      19741508457      14742606348      12856010003 
##          ftrackw          rtrackw             hmpg           height 
##      12451441652      10845978953       5357679299       3320947633 
##            miles            width           gclear             year 
##       3270158315       2695328655       2265313036       1264272899 
##   fuel_Flex.Fuel           length body_Van.Minivan        drive_4X4 
##        590840609        417063959        382308629         44432916
plotcp(treefit2)

The same variables used in the full tree are used here. We will compare these two trees later, and determine whether the prune was beneficial.


Ensemble Methods

The following techniques use multiple trees, similar to the one created earlier to improve predictive performance.


Bagging

set.seed(1)
library(randomForest)
bagfit <- randomForest(price~.,data=train,mtry=34,importance=TRUE,ntrees=500)
bagfit
## 
## Call:
##  randomForest(formula = price ~ ., data = train, mtry = 34, importance = TRUE,      ntrees = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 34
## 
##           Mean of squared residuals: 8016201
##                     % Var explained: 88.32
plot(bagfit)

importance(bagfit)
##                             %IncMSE IncNodePurity
## year                     17.9123464  6.756059e+08
## miles                    55.3559449  5.021713e+09
## owners                    4.2057567  7.040139e+07
## hp                       33.9238187  7.280087e+09
## torque                   47.7777638  1.980769e+10
## height                   16.8068401  7.139838e+08
## length                   18.8506606  7.637608e+08
## width                    23.2088188  1.135206e+09
## gclear                   36.1575120  2.347894e+09
## wheelb                   16.2294797  9.634534e+08
## ftrackw                  19.2507752  8.704711e+08
## rtrackw                  16.4799852  7.434917e+08
## cmpg                     18.6235640  1.057118e+09
## hmpg                     23.1393083  1.013470e+09
## body_Coupe                0.0000000  1.759771e+06
## body_Hatchback           -0.3258812  1.561620e+06
## body_Pickup.Truck         0.0000000  6.810292e+05
## body_Sedan                3.4988328  2.298938e+07
## body_Van.Minivan          2.6642150  5.910040e+06
## body_Wagon               -5.7019351  1.787759e+07
## interior_Brown            1.2025917  1.179719e+08
## interior_Gray             4.4562188  3.318091e+07
## interior_Red              0.4416363  7.683404e+07
## interior_Tan              1.3967274  5.157930e+07
## interior_White           -1.0010015  5.368745e+06
## tm_Manual                 0.0000000  0.000000e+00
## tm_Other                 12.6652126  5.373878e+08
## drive_4X4                 6.0575585  5.031143e+07
## drive_AWD                16.2863412  5.762523e+08
## drive_RWD                12.0861327  1.498167e+08
## fuel_Diesel               0.0000000  1.988815e+06
## fuel_Electric             2.5198618  7.085499e+06
## fuel_Flex.Fuel            6.4441273  6.758545e+07
## fuel_Gas.Electric.Hybrid  0.0000000  1.358304e+05

Decision trees by themselves suffer from high variance. Bagging is bootstrap aggregation, which serves to lower this variance by repeatedly sampling from our training dataset and averaging model predictions.

In bagging, m=p, the number of predictors.

The %IncMSE column in the above output gives insight to the importance of various predictor variables, with higher values being more important.


Random Forests

## 
## Call:
##  randomForest(formula = price ~ ., data = train, mtry = 34/3,      ntree = 500, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 11
## 
##           Mean of squared residuals: 7860924
##                     % Var explained: 88.54

##                             %IncMSE IncNodePurity
## year                     17.5326657  6.225325e+08
## miles                    40.1619497  3.527936e+09
## owners                    0.5872402  1.152989e+08
## hp                       26.1096867  8.228528e+09
## torque                   28.5347483  1.002491e+10
## height                   19.3707837  1.805935e+09
## length                   19.3317572  1.594883e+09
## width                    22.5085915  3.435809e+09
## gclear                   21.1417276  2.212356e+09
## wheelb                   16.2334243  2.294654e+09
## ftrackw                  16.6132525  2.425421e+09
## rtrackw                  14.4728485  2.932286e+09
## cmpg                     17.5083242  1.066403e+09
## hmpg                     19.3927308  1.689799e+09
## body_Coupe                0.0000000  6.098754e+06
## body_Hatchback            1.3639453  2.732913e+06
## body_Pickup.Truck         0.0000000  2.146812e+06
## body_Sedan                6.8404915  6.279676e+07
## body_Van.Minivan          4.7990773  1.658776e+08
## body_Wagon               -1.8548244  2.073265e+07
## interior_Brown            0.8077082  1.485756e+08
## interior_Gray             4.1622726  5.228649e+07
## interior_Red              2.2674901  1.275485e+08
## interior_Tan              1.5268487  5.518285e+07
## interior_White            1.0010015  5.134376e+06
## tm_Manual                 0.0000000  0.000000e+00
## tm_Other                  7.4752071  2.269132e+08
## drive_4X4                 5.0726324  8.271994e+07
## drive_AWD                13.3909365  5.067958e+08
## drive_RWD                 8.1246764  1.817168e+08
## fuel_Diesel               0.0000000  5.279934e+06
## fuel_Electric             3.3521605  1.185738e+07
## fuel_Flex.Fuel            8.4721014  1.568712e+08
## fuel_Gas.Electric.Hybrid  0.0000000  2.146776e+05

Random forests are very similar to bagged trees, with similar methodology. The difference is in decorrelating trees by randomly sampling a subset of our predictors for consideration at node splits. Bagging considers all predictors as candidates for a node split. This difference serves to differentiate the trees in the forest, and gives other predictors a larger ‘voice’.

In random forests, m=p/3 for regression purposes.


Boosting

Boosting follows the methodology of growing multiple trees, except boosting grows trees in a sequence – each tree is based on information from older trees. Boosting involves a shrinkage parameter that causes the model to learn more slowly, assigning weights to observations.

library(gbm)
set.seed(1)

trees <- seq(500,10000,500)
shrinks <- seq(0.05,0.5,0.01)
depths <- seq(1,10,1)
trainerrors <- c()
testerrors <- c()

#for (t in trees) {
#  tempboost <- gbm(price~.,data=train,distribution="gaussian",n.trees=t)
#  boostftrp <- predict(tempboost, newdata = train, n.trees=t)
#  boostftrmse <- mean((train$price-boostftrp)^2) # test mse
#  boostftep <- predict(tempboost, newdata = test, n.trees=t)
#  boostftemse <- mean((test$price-boostftep)^2) # test mse
#  trainerrors <- rbind(trainerrors,boostftrmse)
#  testerrors <- rbind(testerrors,boostftemse)
#} #4000 trees 

#for (s in shrinks) {
#  tempboost <- gbm(price~.,data=train,distribution="gaussian",n.trees=4000,shrinkage = s)
#  boostftrp <- predict(tempboost, newdata = train, n.trees=4000)
#  boostftrmse <- mean((train$price-boostftrp)^2) # test mse
#  boostftep <- predict(tempboost, newdata = test, n.trees=4000)
#  boostftemse <- mean((test$price-boostftep)^2) # test mse
#  trainerrors <- rbind(trainerrors,boostftrmse)
#  testerrors <- rbind(testerrors,boostftemse)
#} #best shrinkage is 0.08

#for (d in depths) {
#  tempboost <- gbm(price~.,data=train,distribution="gaussian",n.trees=4000,shrinkage = 0.08,interaction.depth=d)
#  boostftrp <- predict(tempboost, newdata = train, n.trees=4000)
#  boostftrmse <- mean((train$price-boostftrp)^2) # test mse
#  boostftep <- predict(tempboost, newdata = test, n.trees=4000)
#  boostftemse <- mean((test$price-boostftep)^2) # test mse
#  trainerrors <- rbind(trainerrors,boostftrmse)
#  testerrors <- rbind(testerrors,boostftemse)
#} #best depth is 4

#testerrors
#max(depths[testerrors == min(testerrors)])
#trainerrors

boostfit <- gbm(price~.,data=train,distribution="gaussian",n.trees=4000,shrinkage=0.08,interaction.depth=4)

Cross validation is used to tune the multiple parameters used in boosting. We first select the best number of trees to use to prevent overfitting from too many trees. The shrinkage controls learning speed, and is selected next. The complexity of the ensemble is the last parameter to be selected, controlling for the number of variable interactions.

From selection, we find that 4000 trees, lambda = 0.08, and a depth of 4 to be optimal for our data.

The distribution is set to Gaussian here because we are performing regression. It would be set to Bernoulli if we were classifying.


Evaluation and Comparison of Tree Based Models

##         model train.rmse test.rmse
## 1   reg. tree    4076.82   4699.55
## 2 pruned tree     4315.7   4627.74
## 3     bagging     1198.8   2661.51
## 4 rand forest    1354.62   2401.95
## 5    boosting      406.7   2538.01

The RMSE of both train and test sets are used again to determine model performance, giving the dollar amount by which our models deviate.

Both decision trees offer better performance than our linear models. This suggests that a nonlinear relationship is present. We note that the pruned tree has slightly better test data performance, implying a better model despite its higher train RMSE.

The ensemble methods have greatly increased performance, with all models showing half of the deviance of nonensemble models. They have lower variance but slightly higher bias as a tradeoff. These methods have the strongest predictive power, but low explainability due to their inherent complexity.

Bagged trees and our random forest have similar performance due to the similarities in their methodology. The random forest has the best performance out of all models here. This implies some of the features we found to be important may be overstated in other models.

The boosting model performed second best. There seems to be overfitting present in our model, as the train RMSE is the lowest of all models by far. With more dedicated tuning using hyperplanes, this can be resolved. Boosting is able to reduce both bias and variance, while bagging and random forests have biases locked to a single tree’s.

It should be noted that the nonensemble methods provide better explaining power at the cost of predictive power, and that ensemble methods provide the reverse. Ensemble methods also require much more computation power/time.


All Model Results

The RMSEs of all models used in this analysis are shown here for comparison.

##          model train.rmse test.rmse
## 1         base    4432.15   4787.44
## 2      removed    4432.47   4795.76
## 3     stepwise    3845.22   4709.17
## 4        lasso     4119.7    4750.1
## 5        ridge    3979.86   4990.03
## 6    reg. tree    4076.82   4699.55
## 7  pruned tree     4315.7   4627.74
## 8      bagging     1198.8   2661.51
## 9  rand forest    1354.62   2401.95
## 10    boosting      406.7   2538.01