Automated Variable Selection

Automated Variable Selection

The Amex, Iowa housing data set build has been utilized to develop various iterative regression models to determine the mean sales price of a house based on numerous variables. The variables range correlated, continuous variables to categorical variables. In this installment, we continue building the model using raw categories and later, the dummy coding.  Automated variable selection procedures using various selection procedures will be conducted where we will view some of the categorical variables being incorporated into the model including the entire category full variable and not just a few of the automated selected categories.  Finally, a validation framework will be conducted whereby the automated variable selection will again be applied to compare the results of the training sample versus the test data.  Operational validation will also be carried out with the resulting models being compared to the full test data.  Finally, the best model will be provided along with the results and some potential next steps for consideration.

Part A: Dummy Coding of Categorical Variables

Part 1:  Categorical Automated Variable Selection

The categorical value, BldgType has been chosen for further analysis.  BldgType consists of five different types:

  • 1Fam: Single-family Detached
  • 2FmCon: Two-family Conversion; originally built as one-family dwelling
  • Duplx: Duplex
  • TwnhsE: Townhouse End Unit
  • TwnhsI: Townhouse Inside Unit

The frequency of each BldgType is as follows:

BldgType Frequency Percent Cumulative
Frequency
Cumulative
Percent
1Fam 2425 82.76 2425 82.76
2fmCon 62 2.12 2487 84.88
Duplex 109 3.72 2596 88.60
Twnhs 101 3.45 2697 92.05
TwnhsE 233 7.95 2930 100.00

As we can see from above, there are a number of observations in each category with the 2 Family Complex having the least with just 62 observations.  Due to this, all categories have been utilized.  If a category had few observations they could be combined into one but as described is not the case.  Based on the above, the means of each category are as follows:

Automated Variable Selection - Category Means

Automated Variable Selection – Category Means

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 180677 2571.22135 70.27 <.0001
BldgCategory 1 91.31542 1619.89273 0.06 0.9550

 

0 1   1Fam 2   2fmCon 3  Duplex
Equation  $    180,677.00  $  180,677.00  $  180,677.00  $  180,677.00
2825.27 x Cat#                          –                 91.32              182.63              273.95
Predicted Value   $    180,677.00  $   180,768.32   $  180,859.63  $   180,950.95
Mean Value  $    135,934.06  $  184,812.04  $  125,581.71  $  139,808.94
Variance  $       44,742.94  $    (4,043.72)  $    55,277.92  $    41,142.01

As shown above, for category 1 – 1Fam, the predicted value is $180,677 + (1 x $91.32) = $180,768.32 however, as shown earlier the mean value is $184,812.04 for a difference of -$4,043.72.  Therefore, we know the model is wrong.

Similarly, for a 2famCon the equation from is $180,677 + (2 x $182.63) = $180,859.63 but the mean value is $125,581.71.  The model is supposed to predict the mean values which clearly, it is not so again, the model is wrong.

Throughout this iteration of the regression models, transformation of values using log have not been utilized so the results can be easily understood by all potential home buyers.  Using log can assist in dealing with skewed data but can also be confusing to understand.

A simple linear regression model would be:

 

 

 

 

Automated Variable Selection - Linear Regression SalePrice

Automated Variable Selection – Linear Regression SalePrice

Categorical or qualitative variables between the response variable and the regressor variable donot go through the mean of Y in each category as shown above.   As we can see above with the Predicted Value plot, the observations do go through the intercept of $180,677 but do not go through the rest of the values as outlined in the variance chart above.

Categorical Variable Conclusion

We have chosen the BldgType categorical variable as the variable to predict SalePrice. BldgType has five different types. Based on the mean values, a simple linear regression model was provided to determine SalePrice based on BldType.  Based on a comparison between the mean value and the expected value and the resulting variances, we know the model is incorrect.  However, due to the type of variable – BlgType categorical, we need to assign dummy codes to the variables in order to be able to interpret the variables.

Step 2:  Dummy Code

As qualitative variables have no natural scale of measurement, we need to assign a set of levels to them to account for the effect that the variable may have on the response through the use of indicator variables or often referred to as dummy variables (text, p. 260).

Using SAS, several dummy codes are assigned to the categorical variables in BldgType as follows:

If BldgType = ‘1Fam’ then BldgCategory1 = 1; else BldgCategory1 = 0;

If BldgType = ‘2FmCon’ then BldgCategory2 = 2; else BldgCategory2 = 0;

If BldgType = ‘Duplex’ then BldgCategory3 = 3; else BldgCategory3 = 0;

If BldgType = ‘TwnhsE’ then BldgCategory4 = 4; else BldgCategory4 = 0;

If BldgType = ‘TwnhsI’ then BldgCategory = 5; else BldgCategory5 = 0;

 While there are five BldType’s one variable is used as the basis for interpretation.  Thus, while we five variables, we base our model on four variables and have used BldgCategory5 as the basis of our interpretation.

The summary table is as follows:

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 3 6.412939E11 2.137646E11 34.65 <.0001
Error 2926 1.805124E13 6169256061
Corrected Total 2929 1.869254E13
Root MSE 78545 R-Square 0.0343
Dependent Mean 180796 Adj R-Sq 0.0333
Coeff Var 43.44376  

 

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept 1 131996 6152.08881 21.46 <.0001 0
BldgCategory1 1 52816 6355.48738 8.31 <.0001 2.73653
BldgCategory2 0 0
BldgCategory3 1 2604.19332 3239.45942 0.80 0.4215 1.60663
BldgCategory4 1 15079 2005.08246 7.52 <.0001 2.23625

SalePrice = .  Therefore, the average sale price by BldgCategroy is as follows:

  • BldgCategory1: SalePrice = $131,996 + ($52,816 x 1) for Catgory1 = $184,812
  • BldgCategory2: SalePrice = $131,996 + $0 for Category2 = $131,996**
  • BldgCategory3: SalePrice = $131,996 + ($2,604.19 x 3) for Category3 = $139,808.57
  • BldgCategory4: SalePrice = $131,996 + ($15,079 x 4) for Category 4 = $192,312

**However, in BldgCategory2 SAS alerts us to an error that the combination of variables results in perfect multicollinearity where the variance is infinite and therefore, cannot solve for   As such, the variable is removed from the model which should have been done in the first place.  Based on this, BldgCategory2 will be eliminated in the analysis.

It should be noted, that a BldgCategory can only have one value and thus, we can determine the SalePrice and cannot have two BldgCategory for example, a house cannot be categorized as a 1Fam and a Duplex.

Evaluating the above model for each BldgCategory:

Bldg Category Equation Predicted Mean
1  1Fam 131,996 + 52,816  $  184,812  $  184,812
2  2fmCon Eliminated – multicollinearity         N/A  $  125,581
3  Duplex 131,996 + 0 + (3*2604.19)  $  139,808  $  139,808
4  TwnhsE 131,996 + 0 + 0 + (4*15,079)  $  192,312  $  192,312

The above reflection of the model if say, the BldgCategory is 1 (1Fam) the equation and the predicted value which is then compared to the mean value as detailed in Step 1.  As the summary table reflects, the predicted and mean values match this, each of them matches unlike in step 1, the line goes through the mean sales price.

As the model is predicting the mean values, we know that the model is correct.

Looking at the ODS generated results, we also gain insight into how the model is performing.

Automated Variable Selection - Building Category

Automated Variable Selection – Building Category

Reviewing the above plots, we can see that the residuals do not form a straight line but if a curve was drawn, would curve upwards to the third line of values.  The Q-Q Plot does not fit the line with heavy and light tails reflecting that there are more extreme values than would be expected if they were a normal distribution.  A few outliers are also seen.  The predicted values versus the sale price line do not fit the 45-degree line but Cook’s D scale looks good as the values are low.   However, we also have a low Adjusted R-Square value of 0.0333.

Task 3:  Hypothesis Test for each beta

Testing the null hypothesis that all of the slopes are zero, the null hypothesis is:

Ho:  ß1 =  ß3 =  ß4 = 0

Note that the above does not include as  ß2 it has been eliminated from the model as described above.

H1:    ß1 ≠ 0 or  ß3≠ 0  0  or  ß≠ 0 at least one mean differs from zero.

The p value for ß1, ß3, ß4   are all <.0001

Rejection of H0 implies that at least one of the regressors contributes significantly to the model.

As we can see in the summary table, the p value is <.0001 in each category except BuildingCategory1, BldgCategory3 which exceeds the .0001 threshold.

Task 4:  Dummy Coded Variables v2

A second categorical variable, ‘RoofStyle’ was selected.  Below are a list of the categories and frequencies of each:

RoofStyle Frequency Percent Cumulative
Frequency
Cumulative
Percent
Flat 20 0.68 20 0.68
Gable 2321 79.22 2341 79.90
Gambre 22 0.75 2363 80.65
Hip 551 18.81 2914 99.45
Mansar 11 0.38 2925 99.83
Shed 5 0.17 2930 100.00

With the frequencies of Flat, Gambre, Mansar and Shed being very low, only two categories were assigned values in SAS as follows:

  • If RoofStyle = ‘Gable’ then RoofStyleCategory1 = 1; else RoofStyleCategory1 = 0;
  • If RoofStyle = ‘Hip’ then RoofStyleCategory2 = 2; else RoofStyleCategory2 = 0;

Based on these two categories, the related statistical information is as follows:

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 1.227429E12 1.227429E12 205.78 <.0001
Error 2928 1.746511E13 5964859178
Corrected Total 2929 1.869254E13
Root MSE 77233 R-Square 0.0657
Dependent Mean 180796 Adj R-Sq 0.0653
Coeff Var 42.71802  

 

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept 1 220753 3129.61887 70.54 <.0001 0
RoofStyleCategory1 1 -50441 3516.31434 -14.34 <.0001 1.00000

 

Automated Variable Selection - Roof style category

Automated Variable Selection – Roof style category

Residuals - Roof Style Category

Residuals – Roof style category

Fit Plot for Roof Style Category

Fit Plt for Roof Style Category

Dummy Code Conclusion

In summary, an analysis on the categorical value, BldgType was conducted using the SAS Proc Sort and Means.  However, because the predicted values did not go through the mean values, we know that dummy variables are required.  As such, a transformation of the BldgType was dummy coded so that the statistical information could be interpreted.  Additionally, a comparison of the dummy coded variables was compared to the predicted mean reflecting that the model was correct.  Hypothesis tests were provided for each of the betas reflecting the threshold of when to accept or reject the null hypothesis.  Finally, a second categorical variable, RoofStyle was selected and also dummy coded with two main categories being utilized ‘Gable’ and ‘Hip’ as they accounted for 98% of the observation on the types of roofs.

Part B:  Automated Variable Selection Procedures

Part 5:  Automated Variable selection

Utilizing a multitude of continuous predictor categorical variables that have been dummy coded, several automated variable selection criteria were utilized to various models as follows:

proc reg data = temp outest = stepwise_summary;

  • adjrsq: model SalePrice = BldgCategory1 BldgCategory2 BldgCategory3 BldgCategory4 BldgCategory5 RoofStyleCategory1 RoofStyleCategory2 LotFrontage LotArea YearBuilt YearRemodel BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF FirstFlrSF SecondFlrSF GrLivArea MasVnrArea LowQualFinSF BedroomAbvGr TotRmsAbvGrd WoodDeckSF GarageArea TotalPorchSF TotalFullBath TotalBath TotalHalfBath PoolArea MiscVal MoSold YrSold TotalFlrSF HouseAge / selection = adjrsq slentry=0.01 slstay=0.01 AIC VIF BIC MSE; output out=stepwise_out pred=yhat residual = resid ucl=ucl lcl=lcl cookd=cook  covratio=cov dffits=dfits press=prss;
  • cp: model SalePrice = BldgCategory1 …selection = cp slentry=0.01 slstay=0.01 …
  • rsquare: model SalePrice = BldgCategory1 … selection = rsquare slentry=0.01 …
  • Forward: model SalePrice = BldgCategory1 …selection = forward slentry=0.01 slstay=0.01 …
  • Backward: model SalePrice = BldgCategory1 … selection = backward slentry=0.01 slstay=0.01 …
  • stepwise: model SalePrice = BldgCategory1 … selection = stepwise slentry=0.01 slstay=0.01 …

2,930 observations were read, 2,420 were used and 510 had missing values.

Adjusted R-Square

The results for Adjusted R-Square:

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept 1 -861952 95368 -9.04 <.0001 0
BldgCategory1 1 26367 3400.45040 7.75 <.0001 3.02363
BldgCategory2 1 -5582.03220 2617.46948 -2.13 0.0331 1.77428
BldgCategory3 1 3773.79916 1402.94914 2.69 0.0072 2.27981
RoofStyleCategory1 1 -6051.65245 5661.14478 -1.07 0.2852 9.36908
RoofStyleCategory2 1 4053.24568 2971.01764 1.36 0.1726 9.62305
LotFrontage 1 -119.37021 44.25236 -2.70 0.0070 1.88815
LotArea 1 0.48446 0.14046 3.45 0.0006 1.43016
YearRemodel 1 445.58812 47.89292 9.30 <.0001 1.85349
BsmtFinSF1 1 13.18428 2.23974 5.89 <.0001 1.90176
TotalBsmtSF 1 22.86654 2.45547 9.31 <.0001 2.16019
GrLivArea 1 68.42510 3.22248 21.23 <.0001 4.68770
MasVnrArea 1 46.83335 5.16944 9.06 <.0001 1.53406
LowQualFinSF 1 -40.55798 15.95519 -2.54 0.0111 1.06490
BedroomAbvGr 1 -12010 1330.85568 -9.02 <.0001 2.11672
TotRmsAbvGrd 1 2881.62377 973.01912 2.96 0.0031 4.19309
WoodDeckSF 1 28.06468 6.80774 4.12 <.0001 1.19834
GarageArea 1 43.63291 4.65153 9.38 <.0001 1.88837
TotalPorchSF 1 27.03325 7.57427 3.57 0.0004 1.20222
TotalFullBath 1 7400.21598 1499.09748 4.94 <.0001 2.23151
PoolArea 1 -80.30235 21.58185 -3.72 0.0002 1.07930
MiscVal 1 -16.47373 1.53016 -10.77 <.0001 1.04987
HouseAge 1 -458.08384 36.28123 -12.63 <.0001 2.31941

Sale Price = $-861,952 + 26367 (BldgCategory1) – 5582.03 (BldgCategory2) + 3773.80 (BldgCategory3) -6051.65 (RoofStyleCategory1) + 4053.25 (RoofStyleCategory2) – 119.37 (LotFrontage) + .48446 (LotArea) + 445.58 (YearRemodel) + 13.18 (BsmtFinSF1) + 22.86 (BsmtFinSF1) + 68.42 (GrLivArea) + 46.83 (MasVnrArea) – 40.56 (LowQualFinSF) – 12010 (BedroomAbvGr) + 2881.62 (TotRmsAbvGrd) + 28.06 (WoodDeckSF) + 43.69 (GarageArea) + 27.03 (TotalPorchSF) + 7400.21 (TotalFullBath) – 80.30 (PoolArea) – 16.47 (MiscVal) – 458.08 (HouseAge

The summary tables is as follows:

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 22 1.352421E13 6.147368E11 449.71 <.0001
Error 2397 3.276607E12 1366961673
Corrected Total 2419 1.680082E13
Root MSE 36972 R-Square 0.8050
Dependent Mean 179662 Adj R-Sq 0.8032
Coeff Var 20.57885  
Automated Variable Selection - Adjusted R Square

Automated Variable Selection – Adjusted R Square

Mallow’s Cp

The results for Mallow’s Cp:

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept 1 -872421 94866 -9.20 <.0001 0
BldgCategory1 1 26470 3399.20678 7.79 <.0001 3.02124
BldgCategory2 1 -5447.83069 2614.53487 -2.08 0.0373 1.77020
BldgCategory3 1 3815.00460 1402.46123 2.72 0.0066 2.27809
RoofStyleCategory2 1 7020.32653 1059.69735 6.62 <.0001 1.22417
LotFrontage 1 -117.69740 44.22599 -2.66 0.0078 1.88579
LotArea 1 0.48918 0.14040 3.48 0.0005 1.42875
YearRemodel 1 447.74919 47.85166 9.36 <.0001 1.85019
BsmtFinSF1 1 13.18016 2.23981 5.88 <.0001 1.90175
TotalBsmtSF 1 22.80852 2.45494 9.29 <.0001 2.15913
GrLivArea 1 68.62406 3.21719 21.33 <.0001 4.67206
MasVnrArea 1 46.84375 5.16959 9.06 <.0001 1.53406
LowQualFinSF 1 -40.96187 15.95119 -2.57 0.0103 1.06430
BedroomAbvGr 1 -12037 1330.65771 -9.05 <.0001 2.11596
TotRmsAbvGrd 1 2862.99096 972.89193 2.94 0.0033 4.19174
WoodDeckSF 1 28.06673 6.80794 4.12 <.0001 1.19834
GarageArea 1 43.37188 4.64525 9.34 <.0001 1.88316
TotalPorchSF 1 26.96148 7.57420 3.56 0.0004 1.20213
TotalFullBath 1 7379.01541 1499.01089 4.92 <.0001 2.23112
PoolArea 1 -80.82871 21.57688 -3.75 0.0002 1.07874
MiscVal 1 -16.49217 1.53010 -10.78 <.0001 1.04974
HouseAge 1 -454.46869 36.12434 -12.58 <.0001 2.29926

Sale Price = $-872421 + 26470 (BldgCategory1) – 5447.83 (BldgCategory2) + 3815.00 (BldgCategory3) -7020.33 (RoofStyleCategory2) – 117.70 (LotFrontage) + .48918 (LotArea) + 447.75 (YearRemodel) + 13.18 (BsmtFinSF1) + 22.80 (TotalBsmtSF) + 68.62 (GrLivArea) + 46.84 (MasVnrArea) – 40.96 (LowQualFinSF) – 12037 (BedroomAbvGr) + 2862.99 (TotRmsAbvGrd) + 28.01 (WoodDeckSF) + 43.37 (GarageArea) + 26.97 (TotalPorchSF) + 7379.01 (TotalFullBath) – 80.83 (PoolArea) – 16.49 (MiscVal) – 454.47 (HouseAge)

The summary tables is as follows:

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 21 1.352265E13 6.439356E11 471.04 <.0001
Error 2398 3.278169E12 1367043028
Corrected Total 2419 1.680082E13
Root MSE 36974 R-Square 0.8049
Dependent Mean 179662 Adj R-Sq 0.8032
Coeff Var 20.57946  
Automated Variable Selection - Mallow's Cp

Automated Variable Selection – Mallow’s Cp

R-Square

The results for R-Square:

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept B -219912 1167254 -0.19 0.8506 0
BldgCategory1 1 26253 3408.45137 7.70 <.0001 3.03341
BldgCategory2 1 -6028.14184 2695.47601 -2.24 0.0254 1.87884
BldgCategory3 1 3696.74638 1411.21543 2.62 0.0089 2.30336
BldgCategory4 0 0 . . . .
BldgCategory5 0 0 . . . .
RoofStyleCategory1 1 -5911.87612 5677.28515 -1.04 0.2978 9.40871
RoofStyleCategory2 1 4046.55598 2978.40550 1.36 0.1744 9.65673
LotFrontage 1 -122.91538 45.12727 -2.72 0.0065 1.96066
LotArea 1 0.47268 0.14148 3.34 0.0008 1.44882
YearBuilt 1 445.49962 38.75422 11.50 <.0001 2.63604
YearRemodel 1 447.66425 48.24002 9.28 <.0001 1.87769
BsmtFinSF1 B 34.38828 3.53030 9.74 <.0001 4.71783
BsmtFinSF2 B 21.21424 5.65386 3.75 0.0002 1.48947
BsmtUnfSF B 21.72594 3.28476 6.61 <.0001 3.75777
TotalBsmtSF 0 0 . . . .
FirstFlrSF B 42.46746 16.37491 2.59 0.0096 73.54948
SecondFlrSF B 38.81104 16.05077 2.42 0.0157 80.87806
GrLivArea B 27.69946 16.00018 1.73 0.0835 115.39580
MasVnrArea 1 46.56235 5.18135 8.99 <.0001 1.53887
LowQualFinSF 0 0 . . . .
BedroomAbvGr 1 -11894 1348.83867 -8.82 <.0001 2.17111
TotRmsAbvGrd 1 2896.26872 976.52027 2.97 0.0030 4.21710
WoodDeckSF 1 27.67704 6.86484 4.03 <.0001 1.21673
GarageArea 1 43.68793 4.66553 9.36 <.0001 1.89696
TotalPorchSF 1 26.79651 7.60402 3.52 0.0004 1.20990
TotalFullBath B 6282.23495 1970.20252 3.19 0.0014 3.84876
TotalBath B 1715.20474 1857.44817 0.92 0.3559 5.20762
TotalHalfBath 0 0 . . . .
PoolArea 1 -81.57447 21.71996 -3.76 0.0002 1.09155
MiscVal 1 -16.53651 1.53421 -10.78 <.0001 1.05389
MoSold 1 -35.18158 282.09733 -0.12 0.9008 1.04584
YrSold 1 -768.34423 580.40773 -1.32 0.1857 1.04540

Sale Price = $-219,912 + 26253 (BldgCategory1) – 6028.14 (BldgCategory2) + 3696.75 (BldgCategory3) -5911.87 (RoofStyleCategory1) + 4046.56 (RoofStyleCategory2) – 122.91 (LotFrontage) + .47268 (LotArea) + 445.50 (Year Built) + 447.66 (YearRemodel) + 34.39 (BsmtFinSF1) + 21.21 (BsmtFinSF2) + 21.72 (BsmtUnfSF) + 42.47 (FirstFlrSF) + 38.81 (SecondFlrSF) + 27.70 (GrLivArea) + 46.56 (MasVnrArea) – 11894 (BedroomAbvGr) + 2896.27 (TotRmsAbvGrd) + 27.68 (WoodDeckSF) + 43.69 (GarageArea) + 26.80 (TotalPorchSF) + 6282.23 (TotalFullBath) + 1715.20 (TotalBath) – 81.57 (PoolArea) – 16.54 (MiscVal) – 35.18 (MoSold) – 768.34 (YrSold)

The summary tables is as follows:

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 27 1.352622E13 5.009712E11 365.95 <.0001
Error 2392 3.274594E12 1368977299
Corrected Total 2419 1.680082E13
Root MSE 37000 R-Square 0.8051
Dependent Mean 179662 Adj R-Sq 0.8029
Coeff Var 20.59402  
Automated Variable Selection - R Square

Automated Variable Selection – R Square

Forward

The results for forward:

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept 1 -1782301 89312 -19.96 <.0001 0
BldgCategory1 1 26463 3400.23662 7.78 <.0001 3.02124
BldgCategory3 1 -3667.85178 1743.56514 -2.10 0.0355 1.77022
BldgCategory4 1 2868.62553 1052.15634 2.73 0.0064 2.27804
RoofStyleCategory2 1 7013.65762 1060.00829 6.62 <.0001 1.22414
LotFrontage 1 -118.22437 44.24125 -2.67 0.0076 1.88594
LotArea 1 0.49109 0.14045 3.50 0.0005 1.42891
YearBuilt 1 454.58072 36.31198 12.52 <.0001 2.31614
YearRemodel 1 446.29592 47.95821 9.31 <.0001 1.85731
BsmtFinSF1 1 13.14202 2.24037 5.87 <.0001 1.90156
TotalBsmtSF 1 22.82616 2.45576 9.29 <.0001 2.15925
GrLivArea 1 27.46345 15.98052 1.72 0.0858 115.20517
MasVnrArea 1 46.95673 5.17037 9.08 <.0001 1.53359
BedroomAbvGr 1 -12030 1331.04986 -9.04 <.0001 2.11592
TotRmsAbvGrd 1 2875.92816 973.25129 2.95 0.0032 4.19229
WoodDeckSF 1 28.07452 6.81001 4.12 <.0001 1.19834
GarageArea 1 43.41491 4.64724 9.34 <.0001 1.88363
TotalPorchSF 1 27.03065 7.57905 3.57 0.0004 1.20293
TotalFullBath 1 7347.02074 1500.33565 4.90 <.0001 2.23370
PoolArea 1 -80.00000 21.58215 -3.71 0.0002 1.07861
MiscVal 1 -16.51693 1.53057 -10.79 <.0001 1.04974
TotalFlrSF 1 41.17028 15.95513 2.58 0.0099 113.41659

SalePrice = $-1782301 + 26463 (BldgCategory1) – 3667.85 (BldgCategory3) + 2868 (BldgCategory4) + 7013.66 (RoofStyleCategory2) – 118.22 (LotFrontage) + .49 (LotArea) + 454.58 (YearBuilt) + 446.29 (YearRemodel) + 13.14 (BsmtFinSF1) + 22.83 (TotalBsmtSF) + 27.46 (GrLivArea) + 46.96 (MasVnrArea) – 12030 (BedroomAbvGr) + 2875.93 (TotRmsAbvGrd) + 28.07 (WoodDeckSF) + 43.41 (GarageArea) + 27.03 (TotalPorchSF) + 7347.02 (TotalFullBath) – 80 (PoolArea) – 16.52 (MiscVal) + 41.17 (TotalFlrSF)

The summary tables is as follows:

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 21 1.352065E13 6.438406E11 470.69 <.0001
Error 2398 3.280164E12 1367874731
Corrected Total 2419 1.680082E13
Root MSE 36985 R-Square 0.8048
Dependent Mean 179662 Adj R-Sq 0.8031
Coeff Var 20.58572  
Automated Variable Selection - Forward

Automated Variable Selection – Forward

Backward

The results for backward:

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept 1 -898361 94705 -9.49 <.0001 0
BldgCategory1 1 30720 2800.77084 10.97 <.0001 2.04445
BldgCategory3 1 4968.19798 1312.78379 3.78 0.0002 1.98960
RoofStyleCategory2 1 7004.04276 1072.60999 6.53 <.0001 1.25012
LotFrontage 1 -136.20619 44.75800 -3.04 0.0024 1.92517
LotArea 1 0.48205 0.14148 3.41 0.0007 1.44606
YearRemodel 1 459.55343 47.79496 9.62 <.0001 1.83983
BsmtFinSF1 1 35.81479 3.46619 10.33 <.0001 4.53972
BsmtFinSF2 1 23.04218 5.59400 4.12 <.0001 1.45543
BsmtUnfSF 1 22.49539 3.25792 6.90 <.0001 3.68985
FirstFlrSF 1 69.32416 4.49442 15.42 <.0001 5.53062
SecondFlrSF 1 68.76492 3.26578 21.06 <.0001 3.34208
MasVnrArea 1 47.14530 5.17415 9.11 <.0001 1.53178
BedroomAbvGr 1 -12245 1340.48877 -9.14 <.0001 2.14039
TotRmsAbvGrd 1 2839.94970 956.51194 2.97 0.0030 4.03865
WoodDeckSF 1 28.84852 6.85068 4.21 <.0001 1.20950
GarageArea 1 43.27370 4.65750 9.29 <.0001 1.88697
TotalPorchSF 1 28.71710 7.57459 3.79 0.0002 1.19835
TotalFullBath 1 7117.04288 1512.48962 4.71 <.0001 2.26406
PoolArea 1 -78.44319 21.62607 -3.63 0.0003 1.08015
MiscVal 1 -16.57944 1.53399 -10.81 <.0001 1.05167
HouseAge 1 -446.97428 36.03808 -12.40 <.0001 2.28088

SalePrice = $-898,361 + 30720 BldgCategory1 + 4968.2 BldgCategory3 + 7004.04 RoofStyleCategory2 – 136.21 LotFrontage + .48205 LotArea + 459.55 YearRemodel + 35.81 BsmtFinSF1 + 23.04218 BsmtFinSF2 + 22.49 BsmtUnfSF + 69.32 FirstFlrSF + 68.76 SecondFlrsf + 47.14 MasVnrArea -12245 BedroomAbvGr + 2839.95 TotRmsAbvGrd + 28.85 WoodDeckSF + 43.273 GarageArea + 28.72 TotalPorchSF + 7117.04 TotalFullBath – 78.44 PoolArea -16.58 MiscVal – 446.97 HouseAge.

The summary tables is as follows:

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 21 1.351199E13 6.434281E11 469.15 <.0001
Error 2398 3.288825E12 1371486672
Corrected Total 2419 1.680082E13
Root MSE 37034 R-Square 0.8042
Dependent Mean 179662 Adj R-Sq 0.8025
Coeff Var 20.61288  
Automated Variable Selection - Backward

Automated Variable Selection – Backward

Stepwise

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept 1 -1791258 88579 -20.22 <.0001 0
BldgCategory1 1 30720 2788.97315 11.01 <.0001 2.02766
BldgCategory3 1 4993.99705 1312.10462 3.81 0.0001 1.98793
RoofStyleCategory2 1 7025.30365 1060.92716 6.62 <.0001 1.22327
LotFrontage 1 -135.37629 43.61445 -3.10 0.0019 1.82841
LotArea 1 0.48669 0.14061 3.46 0.0005 1.42864
YearBuilt 1 446.51498 36.11812 12.36 <.0001 2.28588
YearRemodel 1 457.74856 47.74653 9.59 <.0001 1.83646
BsmtFinSF1 1 13.24029 2.23817 5.92 <.0001 1.89319
TotalBsmtSF 1 22.84988 2.45267 9.32 <.0001 2.14857
MasVnrArea 1 47.24907 5.17090 9.14 <.0001 1.53016
BedroomAbvGr 1 -12264 1328.27884 -9.23 <.0001 2.10198
TotRmsAbvGrd 1 2841.91110 953.94485 2.98 0.0029 4.01779
WoodDeckSF 1 28.94598 6.80815 4.25 <.0001 1.19476
GarageArea 1 43.34671 4.65270 9.32 <.0001 1.88345
TotalPorchSF 1 28.74800 7.56187 3.80 0.0001 1.19457
TotalFullBath 1 7112.83010 1493.88128 4.76 <.0001 2.20913
PoolArea 1 -77.40687 21.58338 -3.59 0.0003 1.07610
MiscVal 1 -16.59398 1.53219 -10.83 <.0001 1.04940
TotalFlrSF 1 68.87176 3.18078 21.65 <.0001 4.49660

The results for Stepwise:

SalePrice = $-1,791,258 + 30,720 (BldgCategory1) + 4993.997 (BldgCategory3)+7,025.3 (RoofStyleCategory2) – 135.38 (Lot Frontage) + .48669 (LotArea) + 446.51498 (YearBuilt) + 457.74856 (YearRemodel) + 13.24029 (BsmtFinSF1) + 22.84988 (TotalBsmtSF) + 47.24907 (MasVnrArea) -12264 (BedroomAbvGr) + 2841.91110 (TotRmsAbvGrd) + 28.94598 (WoodDeckSF) + 43.34671 (GarageArea) + 28.748 (TotalPorchSF) + 7112.83 (TotalFullBath) -77.40687 (PoolArea) – 16.59398 (MiscVal) + 68.87176 (TotalFlrSF)

The summary tables is as follows:

Root MSE 37030 R-Square 0.8041
Dependent Mean 179662 Adj R-Sq 0.8026
Coeff Var 20.61087  
Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 19 1.350989E13 7.110469E11 518.55 <.0001
Error 2400 3.290925E12 1371218818
Corrected Total 2419 1.680082E13
Automated Variable Selection - Stepwise

Automated Variable Selection – Stepwise

An overall summary of each model using the automated variable selection is as follows:

Model Root MSE Cp R-Square Adj R-Square F Value AIC BIC Parameters
Adjusted R-Square 36,972 20.57885 0.8050 0.8032 449.71 50934.936 50937.528 23
Mallow’s Cp 36,974 20.57946 0.8049 0.8032 471.04 50928.816 50931.282 22
R-Squared 37,000 20.59402 0.8051 0.8029 365.95 50938.175 50940.830 28
Forward 36,985 20.58572 0.8048 0.8031 470.69 50930.288 50932.727 22
Backward 37,034 20.61288 0.8042 0.8025 469.15 50936.669 50938.992 22
Stepwise 37,030 20.61087 0.8041 0.8026 518.55 50934.214 50936.482 20

Based on the above summary, each technique created a different model based on the variables selected.  As reflected above, the Adjusted R-Square and Mallow’s Cp are the most similar with an identical Adjusted R-Square and very similar Root MSE and Cp.  The Forward model is seemed to be in the middle of the Adjusted R-Square model and Mallow’s Cp with BIC AIC values within the range of the other two models.

The number of parameters used in each of the models also varied significantly with the minimum of 20 used in the Stepwise model and the maximum of 28 in the R-Squared model.  Choosing the best regression model often considers the number of regressions to be as accurate as possible but also needs to consider the costs of attaining the data and maintaining the model thus, balancing it with fewer regressors as possible due to the variance of the prediction  increases as regressors increase (p. 328).  Additionally, we can see in the above summary table that the F Value of the R-Squared model is lower than the Adjusted R-Square and Mallow’s Cp model.  The other consideration is that we know that R-Squared increases when additional predictors are added to a model thus, a model with more terms may appear to have a better fit because of the increased terms.

Based on the above while considering the number of parameters, I think the Adjusted R-Square is the best model.

Sale Price = $-861,952 + 26367 (BldgCategory1) – 5582.03 (BldgCategory2) + 3773.80 (BldgCategory3) -6051.65 (RoofStyleCategory1) + 4053.25 (RoofStyleCategory2) – 119.37 (LotFrontage) + .48446 (LotArea) + 445.58 (YearRemodel) + 13.18 (BsmtFinSF1) + 22.86 (BsmtFinSF1) + 68.42 (GrLivArea) + 46.83 (MasVnrArea) – 40.56 (LowQualFinSF) – 12010 (BedroomAbvGr) + 2881.62 (TotRmsAbvGrd) + 28.06 (WoodDeckSF) + 43.69 (GarageArea) + 27.03 (TotalPorchSF) + 7400.21 (TotalFullBath) – 80.30 (PoolArea) – 16.47 (MiscVal) – 458.08 (HouseAge)

Automated Variable Selection Conclusion – Full Data Set

Based on the above, each modelling technique created a different regression model all resulting in different resulting statistics.  While some of the Goodness of Fit statistics had minimal differences between the model’s others were more significant.  I believe the R-Square is the best model due to the highest R-Square adjusted value, R-Square, F Value and AIC despite having more variables over the other models.  This is based on the logic that the variables and able to be acquired easily with no a very low cost and no obstacles exist in attaining the data.

Step 6: Multicollinarity on Automated Variable Selection

Dummy coded variables were both selected in the automated selection process – BldgCategory# and RoofStyleCategory#.  While RoofStyleCategory only has two items and both have been included in the chosen model but in the BldgCategory only three of the variables have been automatically selected.  However, the other two BldgCategory4 and BldgCategory5.  This is due to perfect multicollinarity where the variance inflation factor is infinite and thus, cannot invert the matrix and solve for beta.  Thus, SAS informs us that the model is not full rank and that the estimate is biased which is reflected with a 0 or B in the Degrees of Freedom column.

Thus, despite adding this category to the model, the additional variables (BldgCategory4 and BldgCategory5) do not appear.

Evaluating the continuous variables to determine if they are statistically significant, we have several variables where the p-value is larger than 0.0001 and thus, cannot conclude that a significant difference exists. These variables are as follows:

  • LotFrontage: p=.007
  • LowQualFinSF: p = .0111

Below 0.0001, significant. Over 0.0001, not significant.

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 24 1.468993E13 6.120806E11 464.09 <.0001
Error 2879 3.797091E12 1318892283
Corrected Total 2903 1.848703E13

 

Root MSE 36317 R-Square 0.7946
Dependent Mean 180389 Adj R-Sq 0.7929
Coeff Var 20.13231  

 

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept B 283643 1048796 0.27 0.7868
BldgCategory1 1 23476 3055.76203 7.68 <.0001
BldgCategory2 1 -6663.69163 2446.63474 -2.72 0.0065
BldgCategory3 1 4172.51115 1303.12090 3.20 0.0014
BldgCategory4 0 0 . . .
BldgCategory5 0 0 . . .
LotArea 1 0.26521 0.09511 2.79 0.0053
YearBuilt B 438.12761 35.59630 12.31 <.0001
YearRemodel 1 477.79154 44.05346 10.85 <.0001
BsmtFinSF1 B 38.52489 3.13014 12.31 <.0001
BsmtFinSF2 B 26.20953 4.89464 5.35 <.0001
BsmtUnfSF B 24.25396 2.91452 8.32 <.0001
FirstFlrSF B 45.16961 15.16688 2.98 0.0029
SecondFlrSF B 42.68354 14.94773 2.86 0.0043
GrLivArea 1 22.52528 14.92543 1.51 0.1314
MasVnrArea 1 47.32331 4.45495 10.62 <.0001
BedroomAbvGr 1 -12206 1211.08737 -10.08 <.0001
TotRmsAbvGrd 1 3591.76057 869.75024 4.13 <.0001
WoodDeckSF 1 22.29958 5.85747 3.81 0.0001
GarageArea 1 40.60470 4.21044 9.64 <.0001
TotalPorchSF 1 23.28952 6.88993 3.38 0.0007
TotalFullBath B 6528.68535 1769.20532 3.69 0.0002
TotalBath B -560.77294 1664.61745 -0.34 0.7362
PoolArea 1 -56.20449 19.41995 -2.89 0.0038
MiscVal 1 -10.60894 1.20582 -8.80 <.0001
MoSold 1 95.39370 253.30278 0.38 0.7065
YrSold B -1044.79891 521.10521 -2.00 0.0451
TotalFlrSF 0 0 . . .

Sale Price = $283,643 + 26467 (BldgCategory1) – 6663.69 (BldgCategory2) + 4172.51 (BldgCategory3) +  .26521 (LotArea) + 438.12 (YearBuilt) + 477.79 (YearRemodel) + 38.52 (BsmtFinSF1) + 26.21 (BsmtFinSF2) + 24.25 (BsmtUnfSF) + 45.68 (SecondFlrSF) + 22.52 (GrLivArea) + 47.32 (MasVnrArea) – 12,206 (BedrromAbvGrd) + 3591.76 (TotRmsAbvGrd) + 22.29 (WoodDeckSF) + 40.60 (GarageArea) + 23.29 (TotalPorchSF) + 6528.68 (TotalFullBath) – 560.77 (TotalBath) – 56.20 (PoolArea) – 10.61 (MiscVal) + 95.39 (MoSold) – 1044.80 (YrSold)

The above is the final adjusted model with the three non-statistically significant variables removed but including all BldgCategories.   As shown, two of the BldgCategories are not reflected in the equation as they are perfectly multicollinarity. 

Part C:  Validation Framework

Cross validation of data is accomplished using a training/test split of the data.  Taking a step back, we will split up the data as follows:

  • Training/In-Sample: 70% of the data is used to train the model which you will see below at just over 2,000 observations
  • Test/Out-of-Sample: 30% of the data will be held back with 100% of the data being used for testing which you will see below at around 900 observations

We will then assess the differences in the goodness-of-fit statistics to see how they differ from prior models using 100% of the data and evaluate the differences.

Step 8:  Training Data

Using the automated variable selection, a summary of each of the techniques are provided below.  In each of the models, only the training data is used which consists of 1,672 observations instead of the full dataset which contained 2,420 observations but includes missing values.

Adjusted R-Square

Using the training observations on the data, the Adjusted R-Square model summary is as follows:

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept 1 -843974 120147 -7.02 <.0001 0
BldgCategory1 1 24100 4398.68558 5.48 <.0001 3.22678
BldgCategory2 1 -8479.20085 3341.40474 -2.54 0.0113 2.05397
BldgCategory3 1 2896.38863 1795.83228 1.61 0.1070 2.42093
RoofStyleCategory2 1 7129.59882 1371.01316 5.20 <.0001 1.27554
LotFrontage 1 -123.84500 56.34235 -2.20 0.0281 1.97656
LotArea 1 0.40106 0.15809 2.54 0.0113 1.40113
YearRemodel 1 430.77533 60.66339 7.10 <.0001 1.82482
BsmtFinSF1 1 8.54006 2.93508 2.91 0.0037 2.09251
TotalBsmtSF 1 20.89315 4.00546 5.22 <.0001 3.86713
FirstFlrSF 1 70.36493 5.55393 12.67 <.0001 5.61303
SecondFlrSF 1 64.55788 4.51295 14.31 <.0001 4.03614
MasVnrArea 1 47.48084 6.52991 7.27 <.0001 1.60962
LowQualFinSF 1 37.21592 20.54811 1.81 0.0703 1.06074
BedroomAbvGr 1 -12253 1711.38313 -7.16 <.0001 2.16414
TotRmsAbvGrd 1 3437.03017 1215.36878 2.83 0.0047 4.09604
WoodDeckSF 1 27.52888 8.75038 3.15 0.0017 1.21352
GarageArea 1 47.20537 5.94685 7.94 <.0001 1.93937
TotalPorchSF 1 27.81892 9.46298 2.94 0.0033 1.22996
TotalFullBath 1 10092 2034.89844 4.96 <.0001 2.65568
TotalHalfBath 1 3401.02263 2318.13769 1.47 0.1425 1.75272
PoolArea 1 -83.69329 27.47086 -3.05 0.0024 1.12358
MiscVal 1 -23.66412 1.96647 -12.03 <.0001 1.08594
HouseAge 1 -424.70078 48.81669 -8.70 <.0001 2.54090

Mallow’s Cp

Using the training observations data, the Mallow’s Cp model summary is as follows:

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept 1 -840908 120154 -7.00 <.0001 0
BldgCategory1 1 24294 4388.81865 5.54 <.0001 3.21114
BldgCategory2 1 -7687.49655 3245.97774 -2.37 0.0180 1.93762
BldgCategory3 1 2984.57200 1785.88782 1.67 0.0949 2.39332
RoofStyleCategory2 1 7211.79054 1352.13044 5.33 <.0001 1.24019
LotFrontage 1 -119.52279 55.36294 -2.16 0.0310 1.90774
LotArea 1 0.42285 0.15738 2.69 0.0073 1.38814
YearRemodel 1 431.13897 60.67388 7.11 <.0001 1.82478
BsmtFinSF1 1 9.55869 2.85710 3.35 0.0008 1.98208
TotalBsmtSF 1 22.38886 3.04707 7.35 <.0001 2.23712
GrLivArea 1 68.00277 3.96977 17.13 <.0001 4.65655
MasVnrArea 1 48.06803 6.51989 7.37 <.0001 1.60410
LowQualFinSF 1 -30.07548 20.49185 -1.47 0.1424 1.05455
BedroomAbvGr 1 -12345 1691.18699 -7.30 <.0001 2.11259
TotRmsAbvGrd 1 3366.84955 1214.59310 2.77 0.0056 4.08933
WoodDeckSF 1 28.40475 8.73491 3.25 0.0012 1.20879
GarageArea 1 47.13780 5.94162 7.93 <.0001 1.93526
TotalPorchSF 1 28.21510 9.44812 2.99 0.0029 1.22565
TotalFullBath 1 8971.49719 1892.49088 4.74 <.0001 2.29615
PoolArea 1 -84.39866 27.44621 -3.08 0.0021 1.12115
MiscVal 1 -23.49573 1.96161 -11.98 <.0001 1.08019
HouseAge 1 -445.57359 46.41973 -9.60 <.0001 2.29666
Automated Variable Selection - Validation Mallow's Cp

Automated Variable Selection – Validation Mallow’s Cp

R-Square

Using the training observations data, the R-Square model summary is as follows:

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept B 271537 1484146 0.18 0.8549 0
BldgCategory1 1 23989 4406.21865 5.44 <.0001 3.23276
BldgCategory2 1 -8478.80283 3351.17430 -2.53 0.0115 2.06276
BldgCategory3 1 2859.88123 1798.57360 1.59 0.1120 2.42453
BldgCategory4 0 0 . . . .
BldgCategory5 0 0 . . . .
RoofStyleCategory1 1 -4257.98288 6825.03906 -0.62 0.5328 8.53205
RoofStyleCategory2 1 5053.32160 3602.34296 1.40 0.1609 8.79224
LotFrontage 1 -124.71032 56.41024 -2.21 0.0272 1.97822
LotArea 1 0.39253 0.15905 2.47 0.0137 1.41602
YearBuilt 1 425.87291 49.72948 8.56 <.0001 2.63086
YearRemodel 1 433.10734 61.39363 7.05 <.0001 1.86609
BsmtFinSF1 B 29.44503 4.34360 6.78 <.0001 4.57558
BsmtFinSF2 B 22.30864 7.29786 3.06 0.0023 1.42093
BsmtUnfSF B 20.74245 4.03077 5.15 <.0001 3.58348
TotalBsmtSF 0 0 . . . .
FirstFlrSF B 32.93950 20.96756 1.57 0.1164 79.87514
SecondFlrSF B 27.01162 20.57882 1.31 0.1895 83.79231
GrLivArea B 37.51693 20.58794 1.82 0.0686 125.09373
MasVnrArea 1 47.34455 6.53894 7.24 <.0001 1.61155
LowQualFinSF 0 0 . . . .
BedroomAbvGr 1 -12232 1713.67326 -7.14 <.0001 2.16654
TotRmsAbvGrd 1 3446.03263 1217.54125 2.83 0.0047 4.10426
WoodDeckSF 1 27.28586 8.80642 3.10 0.0020 1.22719
GarageArea 1 47.71718 5.97396 7.99 <.0001 1.95403
TotalPorchSF 1 27.57210 9.49428 2.90 0.0037 1.23617
TotalFullBath B 6748.88770 2463.70706 2.74 0.0062 3.88676
TotalBath B 3291.27701 2356.11755 1.40 0.1626 5.26565
TotalHalfBath 0 0 . . . .
PoolArea 1 -85.63831 27.60710 -3.10 0.0020 1.13297
MiscVal 1 -23.62538 1.96873 -12.00 <.0001 1.08673
MoSold 1 -275.81048 359.89128 -0.77 0.4436 1.04079
YrSold 1 -980.82714 739.07943 -1.33 0.1847 1.04896
Automated Variable Selection - Validation R Square

Automated Variable Selection – Validation R Square

Forward

Using the training observations data, the Forward model summary is as follows:

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept 1 -1731524 114861 -15.07 <.0001 0
BldgCategory1 1 24334 4390.34749 5.54 <.0001 3.21131
BldgCategory2 1 -7690.63757 3247.02389 -2.37 0.0180 1.93762
BldgCategory3 1 3002.36345 1786.38089 1.68 0.0930 2.39310
RoofStyleCategory2 1 7195.33895 1352.51100 5.32 <.0001 1.24009
LotFrontage 1 -120.09066 55.38180 -2.17 0.0303 1.90781
LotArea 1 0.42441 0.15744 2.70 0.0071 1.38827
YearBuilt 1 445.28878 46.67546 9.54 <.0001 2.31894
YearRemodel 1 429.39054 60.83216 7.06 <.0001 1.83314
BsmtFinSF1 1 9.50117 2.85775 3.32 0.0009 1.98170
TotalBsmtSF 1 22.37203 3.04882 7.34 <.0001 2.23825
GrLivArea 1 37.78666 20.55161 1.84 0.0662 124.72222
MasVnrArea 1 48.23994 6.52019 7.40 <.0001 1.60321
BedroomAbvGr 1 -12354 1691.80579 -7.30 <.0001 2.11277
TotRmsAbvGrd 1 3374.48124 1215.02275 2.78 0.0055 4.08958
WoodDeckSF 1 28.54544 8.73726 3.27 0.0011 1.20866
GarageArea 1 47.16336 5.94525 7.93 <.0001 1.93637
TotalPorchSF 1 28.42878 9.45734 3.01 0.0027 1.22725
TotalFullBath 1 8947.56166 1894.29789 4.72 <.0001 2.29905
PoolArea 1 -83.72517 27.45360 -3.05 0.0023 1.12104
MiscVal 1 -23.48982 1.96224 -11.97 <.0001 1.08019
TotalFlrSF 1 30.24018 20.49786 1.48 0.1403 123.30754
Automated Variable Selection - Validation Forward

Automated Variable Selection – Validation Forward

Backward

Using the training observations data, the Backward model summary is as follows:

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept 1 -869800 120710 -7.21 <.0001 0
BldgCategory1 1 18744 2991.54371 6.27 <.0001 1.47787
BldgCategory2 1 -8897.32941 2889.24938 -3.08 0.0021 1.52065
RoofStyleCategory2 1 7524.63928 1350.08448 5.57 <.0001 1.22477
YearRemodel 1 450.24030 60.83855 7.40 <.0001 1.81739
BsmtFinSF1 1 31.78586 3.34469 9.50 <.0001 2.69069
BsmtFinSF2 1 26.84265 6.69302 4.01 <.0001 1.18531
BsmtUnfSF 1 22.60594 3.04779 7.42 <.0001 2.03191
GrLivArea 1 74.71645 3.15531 23.68 <.0001 2.91406
MasVnrArea 1 47.22611 6.52973 7.23 <.0001 1.59376
BedroomAbvGr 1 -10932 1541.43217 -7.09 <.0001 1.73845
WoodDeckSF 1 29.99457 8.81104 3.40 0.0007 1.21835
GarageArea 1 46.28525 5.90316 7.84 <.0001 1.89226
TotalPorchSF 1 26.82969 9.46670 2.83 0.0047 1.21886
TotalFullBath 1 8968.61730 1926.90743 4.65 <.0001 2.35796
PoolArea 1 -98.53460 26.97263 -3.65 0.0003 1.07258
MiscVal 1 -23.02329 1.96539 -11.71 <.0001 1.07412
HouseAge 1 -450.02291 46.25954 -9.73 <.0001 2.25931
Automated Variable Selection - Validation Backward

Automated Variable Selection – Validation Backward

Stepwise

Using the training observations data, the Stepward model summary is as follows:

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept 1 -1716883 113840 -15.08 <.0001 0
BldgCategory1 1 17936 2997.75814 5.98 <.0001 1.48649
BldgCategory2 1 -10877 2942.69344 -3.70 0.0002 1.58005
RoofStyleCategory2 1 7150.53026 1352.59512 5.29 <.0001 1.23138
YearBuilt 1 438.07997 46.34024 9.45 <.0001 2.26941
YearRemodel 1 430.51754 60.87042 7.07 <.0001 1.82232
BsmtFinSF1 1 8.89032 2.85083 3.12 0.0018 1.95801
TotalBsmtSF 1 23.22825 3.01450 7.71 <.0001 2.17251
MasVnrArea 1 46.77977 6.52778 7.17 <.0001 1.59546
BedroomAbvGr 1 -12773 1690.26994 -7.56 <.0001 2.09387
TotRmsAbvGrd 1 3298.05155 1203.11458 2.74 0.0062 3.98115
WoodDeckSF 1 30.70852 8.74739 3.51 0.0005 1.20281
GarageArea 1 46.23613 5.90182 7.83 <.0001 1.89455
TotalPorchSF 1 30.09617 9.44231 3.19 0.0015 1.21461
TotalFullBath 1 9240.18250 1897.63053 4.87 <.0001 2.29066
PoolArea 1 -86.41275 26.97373 -3.20 0.0014 1.07445
MiscVal 1 -23.36203 1.96651 -11.88 <.0001 1.07713
TotalFlrSF 1 68.21494 3.92390 17.38 <.0001 4.48636
Automated Variable Selection - Validation Stepwise

Automated Variable Selection – Validation Stepwise

 When comparing the training data set to the full data set, a number of Goodness-of-Fit statistics change including:

  • Root MSE training data increases from an average of 37,000 to 39,053 over the full data which is a 5.26% increase. While this is relatively small increase, we prefer a smaller Root MSE, using the training data decreases the fit of our model.
  • Cp also increased from an average value of 20.598 to 21.61 which is also a relatively small increase of 4.7% over the full data. Again, we prefer to have a smaller Cp value.
  • Small difference in the Adjusted R-Square moving from .802 to .79 in the training vs. full data set respectively which is a decrease of 1.07%. Unlike the Root MSE and Cp, we prefer a higher Adjusted R-Square value.
  • The F Value moved from an average of 458 in the full data to an average of 327 in the training set which is a 39.9% change. A larger F value is preferred and the change in the number of observations significantly changes the indications in the fit of our model.
  • AIC range in the full model had an average of 50,936 but declined to an average of 35,377 in the training sample which is a decline of almost 44%. This is expected due to the number of observations (n) being included in the formula and thus, as the number declines from the full data set to the training set, it is natural for the AIC volume to decline.
  • AIC equationAIC equation
  • BIC range in the full model had an average of 50,936.31 which also declined to an average of 35,380 in the training sample which is also a decline of 44% this is expected as the BIC formula is based on the number of observations (n) thus, as the number of observations decline from the training data from the full data, the BIC statistic would also decline.
  • BIC equation:

BIC equation

  • There was also a decrease in the number of parameters used with the average in the full model of 23.67 and only 21.33 in the training sample which is approximately a 10.9% decrease.

Overall, the training data versus the full dataset have some minor differences in the majority of the goodness-of-fit statistics.  This is expected as the training set data is not going to be as accurate as utilizing all of the all the data but as long as the jumps are not significantly different then we know the model is performing well.  As described above, the only significant jumps were in the AIC and BIC statistics which is a natural reflection of the number of observations.  Thus, our model is performing well.

Step 9:  Summary of Results and Comparisons

Model Adj R-Square F Value AIC BIC Parameters MSE MAE
Adjusted R-Square 0.7946 282.1          35,374.521           35,377.375 24    1,095,324,801.4    22,839.01
Mallow’s Cp 0.7946 308.73          35,372.950           35,375.580 22    1,090,304,849.9    22,757.98
R-Squared 0.7943 239.98          35,377.519           35,380.472 28    1,097,933,922.6    22,829.66
Forward 0.7929 377.41          35,378.720           35,380.870 18    1,136,636,015.7    23,222.32
Backward 0.7926 376.62          35,381.500           35,383.590 18    1,149,916,522.5    23,222.32
Stepwise 0.7929 377.41          35,378.720           35,380.870 18    1,136,636,015.7    23,079.73

Assigning The average squared error (MSE) and the average absolute error (MAE) for the test sample and the validation sample are as follows:

Model Obs train _TYPE_ _FREQ_ MSE_1 MAE_1
Adjusted R Square:  In-Sample 1 0 1 891      1,095,324,801.4            22,839.01
Adjusted R Square:  Out-Sample 2 1 1 2039      1,496,167,717.1            24,052.89
Cp:  In-Sample 1 0 1 891    1,090,304,849.9         22,757.98
Cp:  Out-Sample 2 1 1 2039    1,498,529,842.4         24,026.27
R-Square:  In-Sample 1 0 1 891    1,097,933,922.6         22,829.66
R-Square:  Out-Sample 2 1 1 2039    1,494,877,983.8         24,076.99
Forward:  In-Sample 1 0 1 891    1,136,636,015.7         23,079.73
Forward:  Out-Sample 2 1 1 2039    1,458,140,826.3         23,755.93
Backward:  In-Sample 1 0 1 891    1,149,916,522.5         23,222.32
Backward:  Out-Sample 2 1 1 2039    1,460,556,185.2         23,869.79
Stepwise:  In-Sample 1 0 1 891    1,136,636,015.7         23,079.73
Stepwise:  Out-Sample 2 1 1 2039    1,458,140,826.3         23,755.93

In all of the cases above, the in-sample data performs worse than the out-of-sample data which is no surprise as we would expect the training set to be a bit worse than the test data.  For the average squared error (MSE) the in-sample data was on average approx. 75.7% of the full data MSE.  Additionally, the average absolute error (MAE) on the in-sample data was on average 96% of the out-of-sample data.

Additionally, evaluating the Average Mean Square error the lowest in-sample model was the Cp model whereas the best MSE out of sample was the Forward model.  With the Average Absolute Error, the best in-model was again the Cp model but the best out-of-sample was again the Forward model.

Step 10:  Operational Validation

To change the format of “Prediction_Grade” in SAS we add in a format command as follows:

if yhat >= (SalePrice *0.9) and yhat <= (SalePrice *1.1) then

                Prediction_Grade=”Grade 1″;

                FORMAT Prediction_Grade $7.;

We can verify that it works in the output data variable overview:

From a business perspective, the models from both the training and test set predict house sale price within various levels of frequency and error as follows:

Train = 0 (training sample – 70%)  training set Train = 1 (All data – 100%) test set
Prediction_Grade Frequency Percent Cumulative
Frequency
Cumulative
Percent
Grade 1 430 48.26 430 48.26
Grade 2 162 18.18 592 66.44
Grade 3 299 33.56 891 100.00
Prediction_Grade Frequency Percent Cumulative
Frequency
Cumulative
Percent
Grade 1 1000 49.04 1000 49.04
Grade 2 335 16.43 1335 65.47
Grade 3 704 34.53 2039 100.00

Hence, based on the training set, 430 or 48.26% house sale price we are able to predict within plus/minus 10% error, 162 or 18.18% of the houses sale price we are able to predict between, but not equal to, 10% – 15% error and 299 houses or 33.56% of housing sale prices we are able to predict with more than 15% error.

Similarly, based on the test set, 1,000 or 49.04% house sale price we are able to predict within plus/minus 10% error, 335 or 16.43% of the houses sale price we are able to predict between 10%-15% error and 704 houses or 34.53% of housing sale prices we are able to predict with more than 15% error.

Thus, depending on the expected deliverables of the model, we will be able to determine if the model is complete or whether further reiterations are required.

Step 11: Final Model

Based on step 9, the Forward model is the best model.  However, we did incorporate categorical variables and thus, need to ensure all categorical variables including the RoofStyleCategory# are added back into the model and then applied to the test data.  The results are as follows:

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 26 1.47575E13 5.675961E11 437.85 <.0001
Error 2877 3.729527E12 1296325124
Corrected Total 2903 1.848703E13
Root MSE 36005 R-Square 0.7983
Dependent Mean 180389 Adj R-Sq 0.7964
Coeff Var 19.95933  

 

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept B 341417 1039835 0.33 0.7427 0
BldgCategory1 1 22972 3032.18797 7.58 <.0001 2.94959
BldgCategory2 0 0 . . . .
BldgCategory3 1 -3971.02347 1621.08445 -2.45 0.0144 1.91404
BldgCategory4 1 2936.69439 969.46734 3.03 0.0025 2.46653
BldgCategory5 0 0 . . . .
RoofStyleCategory1 1 -1623.80456 4884.52714 -0.33 0.7396 8.81513
RoofStyleCategory2 1 5975.39267 2560.67281 2.33 0.0197 8.98265
LotArea 1 0.27286 0.09438 2.89 0.0039 1.23545
YearBuilt 1 445.67105 35.41682 12.58 <.0001 2.56547
YearRemodel 1 486.79230 43.70459 11.14 <.0001 1.86135
BsmtFinSF1 1 37.29072 3.10915 11.99 <.0001 4.48675
BsmtFinSF2 1 24.83558 4.86118 5.11 <.0001 1.52529
BsmtUnfSF 1 24.03698 2.89111 8.31 <.0001 3.60834
FirstFlrSF B 40.82363 15.05372 2.71 0.0067 76.55732
SecondFlrSF B 42.24975 14.82388 2.85 0.0044 90.19210
GrLivArea 1 25.11865 14.80217 1.70 0.0898 124.91251
MasVnrArea 1 41.28416 4.49551 9.18 <.0001 1.45294
BedroomAbvGr 1 -12235 1201.70744 -10.18 <.0001 2.21617
TotRmsAbvGrd 1 3105.88521 865.17930 3.59 0.0003 4.14841
WoodDeckSF 1 22.67392 5.80752 3.90 <.0001 1.21089
GarageArea 1 40.36999 4.17657 9.67 <.0001 1.80983
TotalPorchSF 1 23.56863 6.83097 3.45 0.0006 1.21094
TotalFullBath 1 6582.18214 1754.82039 3.75 0.0002 3.91453
TotalBath 1 -567.87782 1650.52211 -0.34 0.7308 5.38145
PoolArea 1 -54.70574 19.26009 -2.84 0.0045 1.06203
MiscVal 1 -10.80135 1.19592 -9.03 <.0001 1.03643
MoSold 1 36.08454 251.33838 0.14 0.8859 1.04020
YrSold 1 -1087.09519 516.66708 -2.10 0.0355 1.03946
TotalFlrSF 0 0 . . . .

Sale Price = $341,417 + 22,972 (BldgCategory1) – 3,971.02 (BldgCategory3) + 2,936.69 (BldgCategory4) – 1,623.80 (RoofStyleCategory1) + 5,975.39 (RoofStyleCategory2) +.27286 (LotArea) + 445.67 (YearBuilt) + 486.79 (YearRemodel) + 37.29 (BsmtFinSF1) + 24.84 (BsmtFinSF2) + 24.04 (BsmtUnfSF) + 40.82 (FirstFlrSF) + 42.25 (SecondFlrSF) + 25.12 (GrLivArea) + 41.28 (MasVnrArea) – 12,235 (BedroomAbvGr) + 3,105.88 (TotRmsAbvGrd) + 22.67 (WoodDeckSF) + 40.37 (GarageArea) + 23.57 (TotalPorchSF) + 6,582.18 (TotalFullBath) – 567.88 (TotalBath) – 54.71 (PoolArea) – 10.80 (MiscVal) + 36.08 (MoSold) – 1,087.09 (YrSold)

Step 12:  Reflection/Conclusion

In conclusion, the housing data from Ames, has been utilized to develop numerous regression models to determine the mean sales price of a house based on numerous variables. The variables range from highly correlated, continuous variables to categorical variables with low correlations. In this installment, we continued building the model by assigning dummy codes to categorical variables and the use of automated variable selection procedures.  A validation framework was conducted and the automated variable selected variables were compared using the training sample versus the test data.  Operational validation based on the values were compared and analyzed.  The final and best model was also provided along with the results.

Understanding the goals of the model are key in determining if they should be employed and to what degree. Additionally, the use of these techniques is more of an art than a science. If the model already meets the needs of its purpose than perhaps these methods are not required. However, if the model is not appropriate than utilizing these techniques to transform the model to become more linear is appropriate.

Employing a process to help determine which observations are outliers and requiring an evaluation can to see if it is valid or if it is an error. If an observation is an error, then it should be removed. However, removing legitimate observations will distort the model and should not be removed. If done correctly, the transformation and deletion of outliers will benefit the model by creating a better, more appropriate model that is more linear. However, it does require additional processes and a certain finesse to determine the appropriate thresholds values. As easily observed from the above discussion, the choice of thresholds can drastically impact the number of observations it impacts and needs to be carefully selected and employed.

As reflected in this iteration in the development of the regression model based on the Ames housing data, we came up with the best model however, additional steps can still be taken. Outliers still exist and need to be further investigated to determine whether the observations are legitimate or not and should be discarded.  This process while potentially time consuming, could provide a lot of insight that may or may not be very valuable.  This additional insight may assist in uncovering other variables that may reduce the 20% of variability that is still not accounted for.   Violation of the assumptions of constant variance and normality as seen in the plots still exists.  Additionally, transformation on the data can be utilized but for simplicity and ease of understanding of potential home buyers has been left in its original dollar values.  However, transforming variables will assist in an increase in the goodness of fit and potentially creating a better fitting model.  Thus, if the intended audience is more sophisticated and well versed in interpreting log values, than I would recommend transforming the variables.  If not, and the current model meets the desired audience and requirements than an outlier investigation as detailed above should be undertaken.

Leave a Reply