Automated Variable Selection
Automated Variable Selection
The Amex, Iowa housing data set build has been utilized to develop various iterative regression models to determine the mean sales price of a house based on numerous variables. The variables range correlated, continuous variables to categorical variables. In this installment, we continue building the model using raw categories and later, the dummy coding. Automated variable selection procedures using various selection procedures will be conducted where we will view some of the categorical variables being incorporated into the model including the entire category full variable and not just a few of the automated selected categories. Finally, a validation framework will be conducted whereby the automated variable selection will again be applied to compare the results of the training sample versus the test data. Operational validation will also be carried out with the resulting models being compared to the full test data. Finally, the best model will be provided along with the results and some potential next steps for consideration.
Part A: Dummy Coding of Categorical Variables
Part 1: Categorical Automated Variable Selection
The categorical value, BldgType has been chosen for further analysis. BldgType consists of five different types:
 1Fam: Singlefamily Detached
 2FmCon: Twofamily Conversion; originally built as onefamily dwelling
 Duplx: Duplex
 TwnhsE: Townhouse End Unit
 TwnhsI: Townhouse Inside Unit
The frequency of each BldgType is as follows:
BldgType  Frequency  Percent  Cumulative Frequency 
Cumulative Percent 
1Fam  2425  82.76  2425  82.76 
2fmCon  62  2.12  2487  84.88 
Duplex  109  3.72  2596  88.60 
Twnhs  101  3.45  2697  92.05 
TwnhsE  233  7.95  2930  100.00 
As we can see from above, there are a number of observations in each category with the 2 Family Complex having the least with just 62 observations. Due to this, all categories have been utilized. If a category had few observations they could be combined into one but as described is not the case. Based on the above, the means of each category are as follows:
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t 
Intercept  1  180677  2571.22135  70.27  <.0001 
BldgCategory  1  91.31542  1619.89273  0.06  0.9550 
0  1 1Fam  2 2fmCon  3 Duplex  
Equation  $ 180,677.00  $ 180,677.00  $ 180,677.00  $ 180,677.00 
2825.27 x Cat#  –  91.32  182.63  273.95 
Predicted Value  $ 180,677.00  $ 180,768.32  $ 180,859.63  $ 180,950.95 
Mean Value  $ 135,934.06  $ 184,812.04  $ 125,581.71  $ 139,808.94 
Variance  $ 44,742.94  $ (4,043.72)  $ 55,277.92  $ 41,142.01 
As shown above, for category 1 – 1Fam, the predicted value is $180,677 + (1 x $91.32) = $180,768.32 however, as shown earlier the mean value is $184,812.04 for a difference of $4,043.72. Therefore, we know the model is wrong.
Similarly, for a 2famCon the equation from is $180,677 + (2 x $182.63) = $180,859.63 but the mean value is $125,581.71. The model is supposed to predict the mean values which clearly, it is not so again, the model is wrong.
Throughout this iteration of the regression models, transformation of values using log have not been utilized so the results can be easily understood by all potential home buyers. Using log can assist in dealing with skewed data but can also be confusing to understand.
A simple linear regression model would be:
Categorical or qualitative variables between the response variable and the regressor variable donot go through the mean of Y in each category as shown above. As we can see above with the Predicted Value plot, the observations do go through the intercept of $180,677 but do not go through the rest of the values as outlined in the variance chart above.
Categorical Variable Conclusion
We have chosen the BldgType categorical variable as the variable to predict SalePrice. BldgType has five different types. Based on the mean values, a simple linear regression model was provided to determine SalePrice based on BldType. Based on a comparison between the mean value and the expected value and the resulting variances, we know the model is incorrect. However, due to the type of variable – BlgType categorical, we need to assign dummy codes to the variables in order to be able to interpret the variables.
Step 2: Dummy Code
As qualitative variables have no natural scale of measurement, we need to assign a set of levels to them to account for the effect that the variable may have on the response through the use of indicator variables or often referred to as dummy variables (text, p. 260).
Using SAS, several dummy codes are assigned to the categorical variables in BldgType as follows:
If BldgType = ‘1Fam’ then BldgCategory1 = 1; else BldgCategory1 = 0;
If BldgType = ‘2FmCon’ then BldgCategory2 = 2; else BldgCategory2 = 0;
If BldgType = ‘Duplex’ then BldgCategory3 = 3; else BldgCategory3 = 0;
If BldgType = ‘TwnhsE’ then BldgCategory4 = 4; else BldgCategory4 = 0;
If BldgType = ‘TwnhsI’ then BldgCategory = 5; else BldgCategory5 = 0;
While there are five BldType’s one variable is used as the basis for interpretation. Thus, while we five variables, we base our model on four variables and have used BldgCategory5 as the basis of our interpretation.
The summary table is as follows:


Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  131996  6152.08881  21.46  <.0001  0 
BldgCategory1  1  52816  6355.48738  8.31  <.0001  2.73653 
BldgCategory2  0  0  
BldgCategory3  1  2604.19332  3239.45942  0.80  0.4215  1.60663 
BldgCategory4  1  15079  2005.08246  7.52  <.0001  2.23625 
SalePrice = . Therefore, the average sale price by BldgCategroy is as follows:
 BldgCategory1: SalePrice = $131,996 + ($52,816 x 1) for Catgory1 = $184,812
 BldgCategory2: SalePrice = $131,996 + $0 for Category2 = $131,996**
 BldgCategory3: SalePrice = $131,996 + ($2,604.19 x 3) for Category3 = $139,808.57
 BldgCategory4: SalePrice = $131,996 + ($15,079 x 4) for Category 4 = $192,312
**However, in BldgCategory2 SAS alerts us to an error that the combination of variables results in perfect multicollinearity where the variance is infinite and therefore, cannot solve for As such, the variable is removed from the model which should have been done in the first place. Based on this, BldgCategory2 will be eliminated in the analysis.
It should be noted, that a BldgCategory can only have one value and thus, we can determine the SalePrice and cannot have two BldgCategory for example, a house cannot be categorized as a 1Fam and a Duplex.
Evaluating the above model for each BldgCategory:
Bldg Category  Equation  Predicted  Mean 
1 1Fam  131,996 + 52,816  $ 184,812  $ 184,812 
2 2fmCon  Eliminated – multicollinearity  N/A  $ 125,581 
3 Duplex  131,996 + 0 + (3*2604.19)  $ 139,808  $ 139,808 
4 TwnhsE  131,996 + 0 + 0 + (4*15,079)  $ 192,312  $ 192,312 
The above reflection of the model if say, the BldgCategory is 1 (1Fam) the equation and the predicted value which is then compared to the mean value as detailed in Step 1. As the summary table reflects, the predicted and mean values match this, each of them matches unlike in step 1, the line goes through the mean sales price.
As the model is predicting the mean values, we know that the model is correct.
Looking at the ODS generated results, we also gain insight into how the model is performing.
Reviewing the above plots, we can see that the residuals do not form a straight line but if a curve was drawn, would curve upwards to the third line of values. The QQ Plot does not fit the line with heavy and light tails reflecting that there are more extreme values than would be expected if they were a normal distribution. A few outliers are also seen. The predicted values versus the sale price line do not fit the 45degree line but Cook’s D scale looks good as the values are low. However, we also have a low Adjusted RSquare value of 0.0333.
Task 3: Hypothesis Test for each beta
Testing the null hypothesis that all of the slopes are zero, the null hypothesis is:
H_{o}: ß1 = ß3 = ß4 = 0
Note that the above does not include as ß2 it has been eliminated from the model as described above.
H_{1}: ß1 ≠ 0 or ß3≠ 0 0 or ß≠ 0 at least one mean differs from zero.
The p value for ß1, ß3, ß4 are all <.0001
Rejection of H_{0 }implies that at least one of the regressors contributes significantly to the model.
As we can see in the summary table, the p value is <.0001 in each category except BuildingCategory1, BldgCategory3 which exceeds the .0001 threshold.
Task 4: Dummy Coded Variables v2
A second categorical variable, ‘RoofStyle’ was selected. Below are a list of the categories and frequencies of each:
RoofStyle  Frequency  Percent  Cumulative Frequency 
Cumulative Percent 
Flat  20  0.68  20  0.68 
Gable  2321  79.22  2341  79.90 
Gambre  22  0.75  2363  80.65 
Hip  551  18.81  2914  99.45 
Mansar  11  0.38  2925  99.83 
Shed  5  0.17  2930  100.00 
With the frequencies of Flat, Gambre, Mansar and Shed being very low, only two categories were assigned values in SAS as follows:
 If RoofStyle = ‘Gable’ then RoofStyleCategory1 = 1; else RoofStyleCategory1 = 0;
 If RoofStyle = ‘Hip’ then RoofStyleCategory2 = 2; else RoofStyleCategory2 = 0;
Based on these two categories, the related statistical information is as follows:


Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  220753  3129.61887  70.54  <.0001  0 
RoofStyleCategory1  1  50441  3516.31434  14.34  <.0001  1.00000 
Dummy Code Conclusion
In summary, an analysis on the categorical value, BldgType was conducted using the SAS Proc Sort and Means. However, because the predicted values did not go through the mean values, we know that dummy variables are required. As such, a transformation of the BldgType was dummy coded so that the statistical information could be interpreted. Additionally, a comparison of the dummy coded variables was compared to the predicted mean reflecting that the model was correct. Hypothesis tests were provided for each of the betas reflecting the threshold of when to accept or reject the null hypothesis. Finally, a second categorical variable, RoofStyle was selected and also dummy coded with two main categories being utilized ‘Gable’ and ‘Hip’ as they accounted for 98% of the observation on the types of roofs.
Part B: Automated Variable Selection Procedures
Part 5: Automated Variable selection
Utilizing a multitude of continuous predictor categorical variables that have been dummy coded, several automated variable selection criteria were utilized to various models as follows:
proc reg data = temp outest = stepwise_summary;
 adjrsq: model SalePrice = BldgCategory1 BldgCategory2 BldgCategory3 BldgCategory4 BldgCategory5 RoofStyleCategory1 RoofStyleCategory2 LotFrontage LotArea YearBuilt YearRemodel BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF FirstFlrSF SecondFlrSF GrLivArea MasVnrArea LowQualFinSF BedroomAbvGr TotRmsAbvGrd WoodDeckSF GarageArea TotalPorchSF TotalFullBath TotalBath TotalHalfBath PoolArea MiscVal MoSold YrSold TotalFlrSF HouseAge / selection = adjrsq slentry=0.01 slstay=0.01 AIC VIF BIC MSE; output out=stepwise_out pred=yhat residual = resid ucl=ucl lcl=lcl cookd=cook covratio=cov dffits=dfits press=prss;
 cp: model SalePrice = BldgCategory1 …selection = cp slentry=0.01 slstay=0.01 …
 rsquare: model SalePrice = BldgCategory1 … selection = rsquare slentry=0.01 …
 Forward: model SalePrice = BldgCategory1 …selection = forward slentry=0.01 slstay=0.01 …
 Backward: model SalePrice = BldgCategory1 … selection = backward slentry=0.01 slstay=0.01 …
 stepwise: model SalePrice = BldgCategory1 … selection = stepwise slentry=0.01 slstay=0.01 …
2,930 observations were read, 2,420 were used and 510 had missing values.
Adjusted RSquare
The results for Adjusted RSquare:
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  861952  95368  9.04  <.0001  0 
BldgCategory1  1  26367  3400.45040  7.75  <.0001  3.02363 
BldgCategory2  1  5582.03220  2617.46948  2.13  0.0331  1.77428 
BldgCategory3  1  3773.79916  1402.94914  2.69  0.0072  2.27981 
RoofStyleCategory1  1  6051.65245  5661.14478  1.07  0.2852  9.36908 
RoofStyleCategory2  1  4053.24568  2971.01764  1.36  0.1726  9.62305 
LotFrontage  1  119.37021  44.25236  2.70  0.0070  1.88815 
LotArea  1  0.48446  0.14046  3.45  0.0006  1.43016 
YearRemodel  1  445.58812  47.89292  9.30  <.0001  1.85349 
BsmtFinSF1  1  13.18428  2.23974  5.89  <.0001  1.90176 
TotalBsmtSF  1  22.86654  2.45547  9.31  <.0001  2.16019 
GrLivArea  1  68.42510  3.22248  21.23  <.0001  4.68770 
MasVnrArea  1  46.83335  5.16944  9.06  <.0001  1.53406 
LowQualFinSF  1  40.55798  15.95519  2.54  0.0111  1.06490 
BedroomAbvGr  1  12010  1330.85568  9.02  <.0001  2.11672 
TotRmsAbvGrd  1  2881.62377  973.01912  2.96  0.0031  4.19309 
WoodDeckSF  1  28.06468  6.80774  4.12  <.0001  1.19834 
GarageArea  1  43.63291  4.65153  9.38  <.0001  1.88837 
TotalPorchSF  1  27.03325  7.57427  3.57  0.0004  1.20222 
TotalFullBath  1  7400.21598  1499.09748  4.94  <.0001  2.23151 
PoolArea  1  80.30235  21.58185  3.72  0.0002  1.07930 
MiscVal  1  16.47373  1.53016  10.77  <.0001  1.04987 
HouseAge  1  458.08384  36.28123  12.63  <.0001  2.31941 
Sale Price = $861,952 + 26367 (BldgCategory1) – 5582.03 (BldgCategory2) + 3773.80 (BldgCategory3) 6051.65 (RoofStyleCategory1) + 4053.25 (RoofStyleCategory2) – 119.37 (LotFrontage) + .48446 (LotArea) + 445.58 (YearRemodel) + 13.18 (BsmtFinSF1) + 22.86 (BsmtFinSF1) + 68.42 (GrLivArea) + 46.83 (MasVnrArea) – 40.56 (LowQualFinSF) – 12010 (BedroomAbvGr) + 2881.62 (TotRmsAbvGrd) + 28.06 (WoodDeckSF) + 43.69 (GarageArea) + 27.03 (TotalPorchSF) + 7400.21 (TotalFullBath) – 80.30 (PoolArea) – 16.47 (MiscVal) – 458.08 (HouseAge
The summary tables is as follows:


Mallow’s Cp
The results for Mallow’s Cp:
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  872421  94866  9.20  <.0001  0 
BldgCategory1  1  26470  3399.20678  7.79  <.0001  3.02124 
BldgCategory2  1  5447.83069  2614.53487  2.08  0.0373  1.77020 
BldgCategory3  1  3815.00460  1402.46123  2.72  0.0066  2.27809 
RoofStyleCategory2  1  7020.32653  1059.69735  6.62  <.0001  1.22417 
LotFrontage  1  117.69740  44.22599  2.66  0.0078  1.88579 
LotArea  1  0.48918  0.14040  3.48  0.0005  1.42875 
YearRemodel  1  447.74919  47.85166  9.36  <.0001  1.85019 
BsmtFinSF1  1  13.18016  2.23981  5.88  <.0001  1.90175 
TotalBsmtSF  1  22.80852  2.45494  9.29  <.0001  2.15913 
GrLivArea  1  68.62406  3.21719  21.33  <.0001  4.67206 
MasVnrArea  1  46.84375  5.16959  9.06  <.0001  1.53406 
LowQualFinSF  1  40.96187  15.95119  2.57  0.0103  1.06430 
BedroomAbvGr  1  12037  1330.65771  9.05  <.0001  2.11596 
TotRmsAbvGrd  1  2862.99096  972.89193  2.94  0.0033  4.19174 
WoodDeckSF  1  28.06673  6.80794  4.12  <.0001  1.19834 
GarageArea  1  43.37188  4.64525  9.34  <.0001  1.88316 
TotalPorchSF  1  26.96148  7.57420  3.56  0.0004  1.20213 
TotalFullBath  1  7379.01541  1499.01089  4.92  <.0001  2.23112 
PoolArea  1  80.82871  21.57688  3.75  0.0002  1.07874 
MiscVal  1  16.49217  1.53010  10.78  <.0001  1.04974 
HouseAge  1  454.46869  36.12434  12.58  <.0001  2.29926 
Sale Price = $872421 + 26470 (BldgCategory1) – 5447.83 (BldgCategory2) + 3815.00 (BldgCategory3) 7020.33 (RoofStyleCategory2) – 117.70 (LotFrontage) + .48918 (LotArea) + 447.75 (YearRemodel) + 13.18 (BsmtFinSF1) + 22.80 (TotalBsmtSF) + 68.62 (GrLivArea) + 46.84 (MasVnrArea) – 40.96 (LowQualFinSF) – 12037 (BedroomAbvGr) + 2862.99 (TotRmsAbvGrd) + 28.01 (WoodDeckSF) + 43.37 (GarageArea) + 26.97 (TotalPorchSF) + 7379.01 (TotalFullBath) – 80.83 (PoolArea) – 16.49 (MiscVal) – 454.47 (HouseAge)
The summary tables is as follows:


RSquare
The results for RSquare:
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  B  219912  1167254  0.19  0.8506  0 
BldgCategory1  1  26253  3408.45137  7.70  <.0001  3.03341 
BldgCategory2  1  6028.14184  2695.47601  2.24  0.0254  1.87884 
BldgCategory3  1  3696.74638  1411.21543  2.62  0.0089  2.30336 
BldgCategory4  0  0  .  .  .  . 
BldgCategory5  0  0  .  .  .  . 
RoofStyleCategory1  1  5911.87612  5677.28515  1.04  0.2978  9.40871 
RoofStyleCategory2  1  4046.55598  2978.40550  1.36  0.1744  9.65673 
LotFrontage  1  122.91538  45.12727  2.72  0.0065  1.96066 
LotArea  1  0.47268  0.14148  3.34  0.0008  1.44882 
YearBuilt  1  445.49962  38.75422  11.50  <.0001  2.63604 
YearRemodel  1  447.66425  48.24002  9.28  <.0001  1.87769 
BsmtFinSF1  B  34.38828  3.53030  9.74  <.0001  4.71783 
BsmtFinSF2  B  21.21424  5.65386  3.75  0.0002  1.48947 
BsmtUnfSF  B  21.72594  3.28476  6.61  <.0001  3.75777 
TotalBsmtSF  0  0  .  .  .  . 
FirstFlrSF  B  42.46746  16.37491  2.59  0.0096  73.54948 
SecondFlrSF  B  38.81104  16.05077  2.42  0.0157  80.87806 
GrLivArea  B  27.69946  16.00018  1.73  0.0835  115.39580 
MasVnrArea  1  46.56235  5.18135  8.99  <.0001  1.53887 
LowQualFinSF  0  0  .  .  .  . 
BedroomAbvGr  1  11894  1348.83867  8.82  <.0001  2.17111 
TotRmsAbvGrd  1  2896.26872  976.52027  2.97  0.0030  4.21710 
WoodDeckSF  1  27.67704  6.86484  4.03  <.0001  1.21673 
GarageArea  1  43.68793  4.66553  9.36  <.0001  1.89696 
TotalPorchSF  1  26.79651  7.60402  3.52  0.0004  1.20990 
TotalFullBath  B  6282.23495  1970.20252  3.19  0.0014  3.84876 
TotalBath  B  1715.20474  1857.44817  0.92  0.3559  5.20762 
TotalHalfBath  0  0  .  .  .  . 
PoolArea  1  81.57447  21.71996  3.76  0.0002  1.09155 
MiscVal  1  16.53651  1.53421  10.78  <.0001  1.05389 
MoSold  1  35.18158  282.09733  0.12  0.9008  1.04584 
YrSold  1  768.34423  580.40773  1.32  0.1857  1.04540 
Sale Price = $219,912 + 26253 (BldgCategory1) – 6028.14 (BldgCategory2) + 3696.75 (BldgCategory3) 5911.87 (RoofStyleCategory1) + 4046.56 (RoofStyleCategory2) – 122.91 (LotFrontage) + .47268 (LotArea) + 445.50 (Year Built) + 447.66 (YearRemodel) + 34.39 (BsmtFinSF1) + 21.21 (BsmtFinSF2) + 21.72 (BsmtUnfSF) + 42.47 (FirstFlrSF) + 38.81 (SecondFlrSF) + 27.70 (GrLivArea) + 46.56 (MasVnrArea) – 11894 (BedroomAbvGr) + 2896.27 (TotRmsAbvGrd) + 27.68 (WoodDeckSF) + 43.69 (GarageArea) + 26.80 (TotalPorchSF) + 6282.23 (TotalFullBath) + 1715.20 (TotalBath) – 81.57 (PoolArea) – 16.54 (MiscVal) – 35.18 (MoSold) – 768.34 (YrSold)
The summary tables is as follows:


Forward
The results for forward:
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  1782301  89312  19.96  <.0001  0 
BldgCategory1  1  26463  3400.23662  7.78  <.0001  3.02124 
BldgCategory3  1  3667.85178  1743.56514  2.10  0.0355  1.77022 
BldgCategory4  1  2868.62553  1052.15634  2.73  0.0064  2.27804 
RoofStyleCategory2  1  7013.65762  1060.00829  6.62  <.0001  1.22414 
LotFrontage  1  118.22437  44.24125  2.67  0.0076  1.88594 
LotArea  1  0.49109  0.14045  3.50  0.0005  1.42891 
YearBuilt  1  454.58072  36.31198  12.52  <.0001  2.31614 
YearRemodel  1  446.29592  47.95821  9.31  <.0001  1.85731 
BsmtFinSF1  1  13.14202  2.24037  5.87  <.0001  1.90156 
TotalBsmtSF  1  22.82616  2.45576  9.29  <.0001  2.15925 
GrLivArea  1  27.46345  15.98052  1.72  0.0858  115.20517 
MasVnrArea  1  46.95673  5.17037  9.08  <.0001  1.53359 
BedroomAbvGr  1  12030  1331.04986  9.04  <.0001  2.11592 
TotRmsAbvGrd  1  2875.92816  973.25129  2.95  0.0032  4.19229 
WoodDeckSF  1  28.07452  6.81001  4.12  <.0001  1.19834 
GarageArea  1  43.41491  4.64724  9.34  <.0001  1.88363 
TotalPorchSF  1  27.03065  7.57905  3.57  0.0004  1.20293 
TotalFullBath  1  7347.02074  1500.33565  4.90  <.0001  2.23370 
PoolArea  1  80.00000  21.58215  3.71  0.0002  1.07861 
MiscVal  1  16.51693  1.53057  10.79  <.0001  1.04974 
TotalFlrSF  1  41.17028  15.95513  2.58  0.0099  113.41659 
SalePrice = $1782301 + 26463 (BldgCategory1) – 3667.85 (BldgCategory3) + 2868 (BldgCategory4) + 7013.66 (RoofStyleCategory2) – 118.22 (LotFrontage) + .49 (LotArea) + 454.58 (YearBuilt) + 446.29 (YearRemodel) + 13.14 (BsmtFinSF1) + 22.83 (TotalBsmtSF) + 27.46 (GrLivArea) + 46.96 (MasVnrArea) – 12030 (BedroomAbvGr) + 2875.93 (TotRmsAbvGrd) + 28.07 (WoodDeckSF) + 43.41 (GarageArea) + 27.03 (TotalPorchSF) + 7347.02 (TotalFullBath) – 80 (PoolArea) – 16.52 (MiscVal) + 41.17 (TotalFlrSF)
The summary tables is as follows:


Backward
The results for backward:
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  898361  94705  9.49  <.0001  0 
BldgCategory1  1  30720  2800.77084  10.97  <.0001  2.04445 
BldgCategory3  1  4968.19798  1312.78379  3.78  0.0002  1.98960 
RoofStyleCategory2  1  7004.04276  1072.60999  6.53  <.0001  1.25012 
LotFrontage  1  136.20619  44.75800  3.04  0.0024  1.92517 
LotArea  1  0.48205  0.14148  3.41  0.0007  1.44606 
YearRemodel  1  459.55343  47.79496  9.62  <.0001  1.83983 
BsmtFinSF1  1  35.81479  3.46619  10.33  <.0001  4.53972 
BsmtFinSF2  1  23.04218  5.59400  4.12  <.0001  1.45543 
BsmtUnfSF  1  22.49539  3.25792  6.90  <.0001  3.68985 
FirstFlrSF  1  69.32416  4.49442  15.42  <.0001  5.53062 
SecondFlrSF  1  68.76492  3.26578  21.06  <.0001  3.34208 
MasVnrArea  1  47.14530  5.17415  9.11  <.0001  1.53178 
BedroomAbvGr  1  12245  1340.48877  9.14  <.0001  2.14039 
TotRmsAbvGrd  1  2839.94970  956.51194  2.97  0.0030  4.03865 
WoodDeckSF  1  28.84852  6.85068  4.21  <.0001  1.20950 
GarageArea  1  43.27370  4.65750  9.29  <.0001  1.88697 
TotalPorchSF  1  28.71710  7.57459  3.79  0.0002  1.19835 
TotalFullBath  1  7117.04288  1512.48962  4.71  <.0001  2.26406 
PoolArea  1  78.44319  21.62607  3.63  0.0003  1.08015 
MiscVal  1  16.57944  1.53399  10.81  <.0001  1.05167 
HouseAge  1  446.97428  36.03808  12.40  <.0001  2.28088 
SalePrice = $898,361 + 30720 BldgCategory1 + 4968.2 BldgCategory3 + 7004.04 RoofStyleCategory2 – 136.21 LotFrontage + .48205 LotArea + 459.55 YearRemodel + 35.81 BsmtFinSF1 + 23.04218 BsmtFinSF2 + 22.49 BsmtUnfSF + 69.32 FirstFlrSF + 68.76 SecondFlrsf + 47.14 MasVnrArea 12245 BedroomAbvGr + 2839.95 TotRmsAbvGrd + 28.85 WoodDeckSF + 43.273 GarageArea + 28.72 TotalPorchSF + 7117.04 TotalFullBath – 78.44 PoolArea 16.58 MiscVal – 446.97 HouseAge.
The summary tables is as follows:


Stepwise
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  1791258  88579  20.22  <.0001  0 
BldgCategory1  1  30720  2788.97315  11.01  <.0001  2.02766 
BldgCategory3  1  4993.99705  1312.10462  3.81  0.0001  1.98793 
RoofStyleCategory2  1  7025.30365  1060.92716  6.62  <.0001  1.22327 
LotFrontage  1  135.37629  43.61445  3.10  0.0019  1.82841 
LotArea  1  0.48669  0.14061  3.46  0.0005  1.42864 
YearBuilt  1  446.51498  36.11812  12.36  <.0001  2.28588 
YearRemodel  1  457.74856  47.74653  9.59  <.0001  1.83646 
BsmtFinSF1  1  13.24029  2.23817  5.92  <.0001  1.89319 
TotalBsmtSF  1  22.84988  2.45267  9.32  <.0001  2.14857 
MasVnrArea  1  47.24907  5.17090  9.14  <.0001  1.53016 
BedroomAbvGr  1  12264  1328.27884  9.23  <.0001  2.10198 
TotRmsAbvGrd  1  2841.91110  953.94485  2.98  0.0029  4.01779 
WoodDeckSF  1  28.94598  6.80815  4.25  <.0001  1.19476 
GarageArea  1  43.34671  4.65270  9.32  <.0001  1.88345 
TotalPorchSF  1  28.74800  7.56187  3.80  0.0001  1.19457 
TotalFullBath  1  7112.83010  1493.88128  4.76  <.0001  2.20913 
PoolArea  1  77.40687  21.58338  3.59  0.0003  1.07610 
MiscVal  1  16.59398  1.53219  10.83  <.0001  1.04940 
TotalFlrSF  1  68.87176  3.18078  21.65  <.0001  4.49660 
The results for Stepwise:
SalePrice = $1,791,258 + 30,720 (BldgCategory1) + 4993.997 (BldgCategory3)+7,025.3 (RoofStyleCategory2) – 135.38 (Lot Frontage) + .48669 (LotArea) + 446.51498 (YearBuilt) + 457.74856 (YearRemodel) + 13.24029 (BsmtFinSF1) + 22.84988 (TotalBsmtSF) + 47.24907 (MasVnrArea) 12264 (BedroomAbvGr) + 2841.91110 (TotRmsAbvGrd) + 28.94598 (WoodDeckSF) + 43.34671 (GarageArea) + 28.748 (TotalPorchSF) + 7112.83 (TotalFullBath) 77.40687 (PoolArea) – 16.59398 (MiscVal) + 68.87176 (TotalFlrSF)
The summary tables is as follows:


An overall summary of each model using the automated variable selection is as follows:
Model  Root MSE  Cp  RSquare  Adj RSquare  F Value  AIC  BIC  Parameters 
Adjusted RSquare  36,972  20.57885  0.8050  0.8032  449.71  50934.936  50937.528  23 
Mallow’s Cp  36,974  20.57946  0.8049  0.8032  471.04  50928.816  50931.282  22 
RSquared  37,000  20.59402  0.8051  0.8029  365.95  50938.175  50940.830  28 
Forward  36,985  20.58572  0.8048  0.8031  470.69  50930.288  50932.727  22 
Backward  37,034  20.61288  0.8042  0.8025  469.15  50936.669  50938.992  22 
Stepwise  37,030  20.61087  0.8041  0.8026  518.55  50934.214  50936.482  20 
Based on the above summary, each technique created a different model based on the variables selected. As reflected above, the Adjusted RSquare and Mallow’s Cp are the most similar with an identical Adjusted RSquare and very similar Root MSE and Cp. The Forward model is seemed to be in the middle of the Adjusted RSquare model and Mallow’s Cp with BIC AIC values within the range of the other two models.
The number of parameters used in each of the models also varied significantly with the minimum of 20 used in the Stepwise model and the maximum of 28 in the RSquared model. Choosing the best regression model often considers the number of regressions to be as accurate as possible but also needs to consider the costs of attaining the data and maintaining the model thus, balancing it with fewer regressors as possible due to the variance of the prediction increases as regressors increase (p. 328). Additionally, we can see in the above summary table that the F Value of the RSquared model is lower than the Adjusted RSquare and Mallow’s Cp model. The other consideration is that we know that RSquared increases when additional predictors are added to a model thus, a model with more terms may appear to have a better fit because of the increased terms.
Based on the above while considering the number of parameters, I think the Adjusted RSquare is the best model.
Sale Price = $861,952 + 26367 (BldgCategory1) – 5582.03 (BldgCategory2) + 3773.80 (BldgCategory3) 6051.65 (RoofStyleCategory1) + 4053.25 (RoofStyleCategory2) – 119.37 (LotFrontage) + .48446 (LotArea) + 445.58 (YearRemodel) + 13.18 (BsmtFinSF1) + 22.86 (BsmtFinSF1) + 68.42 (GrLivArea) + 46.83 (MasVnrArea) – 40.56 (LowQualFinSF) – 12010 (BedroomAbvGr) + 2881.62 (TotRmsAbvGrd) + 28.06 (WoodDeckSF) + 43.69 (GarageArea) + 27.03 (TotalPorchSF) + 7400.21 (TotalFullBath) – 80.30 (PoolArea) – 16.47 (MiscVal) – 458.08 (HouseAge)
Automated Variable Selection Conclusion – Full Data Set
Based on the above, each modelling technique created a different regression model all resulting in different resulting statistics. While some of the Goodness of Fit statistics had minimal differences between the model’s others were more significant. I believe the RSquare is the best model due to the highest RSquare adjusted value, RSquare, F Value and AIC despite having more variables over the other models. This is based on the logic that the variables and able to be acquired easily with no a very low cost and no obstacles exist in attaining the data.
Step 6: Multicollinarity on Automated Variable Selection
Dummy coded variables were both selected in the automated selection process – BldgCategory# and RoofStyleCategory#. While RoofStyleCategory only has two items and both have been included in the chosen model but in the BldgCategory only three of the variables have been automatically selected. However, the other two BldgCategory4 and BldgCategory5. This is due to perfect multicollinarity where the variance inflation factor is infinite and thus, cannot invert the matrix and solve for beta. Thus, SAS informs us that the model is not full rank and that the estimate is biased which is reflected with a 0 or B in the Degrees of Freedom column.
Thus, despite adding this category to the model, the additional variables (BldgCategory4 and BldgCategory5) do not appear.
Evaluating the continuous variables to determine if they are statistically significant, we have several variables where the pvalue is larger than 0.0001 and thus, cannot conclude that a significant difference exists. These variables are as follows:
 LotFrontage: p=.007
 LowQualFinSF: p = .0111
Below 0.0001, significant. Over 0.0001, not significant.
Analysis of Variance  
Source  DF  Sum of Squares 
Mean Square 
F Value  Pr > F 
Model  24  1.468993E13  6.120806E11  464.09  <.0001 
Error  2879  3.797091E12  1318892283  
Corrected Total  2903  1.848703E13 
Root MSE  36317  RSquare  0.7946 
Dependent Mean  180389  Adj RSq  0.7929 
Coeff Var  20.13231 
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t 
Intercept  B  283643  1048796  0.27  0.7868 
BldgCategory1  1  23476  3055.76203  7.68  <.0001 
BldgCategory2  1  6663.69163  2446.63474  2.72  0.0065 
BldgCategory3  1  4172.51115  1303.12090  3.20  0.0014 
BldgCategory4  0  0  .  .  . 
BldgCategory5  0  0  .  .  . 
LotArea  1  0.26521  0.09511  2.79  0.0053 
YearBuilt  B  438.12761  35.59630  12.31  <.0001 
YearRemodel  1  477.79154  44.05346  10.85  <.0001 
BsmtFinSF1  B  38.52489  3.13014  12.31  <.0001 
BsmtFinSF2  B  26.20953  4.89464  5.35  <.0001 
BsmtUnfSF  B  24.25396  2.91452  8.32  <.0001 
FirstFlrSF  B  45.16961  15.16688  2.98  0.0029 
SecondFlrSF  B  42.68354  14.94773  2.86  0.0043 
GrLivArea  1  22.52528  14.92543  1.51  0.1314 
MasVnrArea  1  47.32331  4.45495  10.62  <.0001 
BedroomAbvGr  1  12206  1211.08737  10.08  <.0001 
TotRmsAbvGrd  1  3591.76057  869.75024  4.13  <.0001 
WoodDeckSF  1  22.29958  5.85747  3.81  0.0001 
GarageArea  1  40.60470  4.21044  9.64  <.0001 
TotalPorchSF  1  23.28952  6.88993  3.38  0.0007 
TotalFullBath  B  6528.68535  1769.20532  3.69  0.0002 
TotalBath  B  560.77294  1664.61745  0.34  0.7362 
PoolArea  1  56.20449  19.41995  2.89  0.0038 
MiscVal  1  10.60894  1.20582  8.80  <.0001 
MoSold  1  95.39370  253.30278  0.38  0.7065 
YrSold  B  1044.79891  521.10521  2.00  0.0451 
TotalFlrSF  0  0  .  .  . 
Sale Price = $283,643 + 26467 (BldgCategory1) – 6663.69 (BldgCategory2) + 4172.51 (BldgCategory3) + .26521 (LotArea) + 438.12 (YearBuilt) + 477.79 (YearRemodel) + 38.52 (BsmtFinSF1) + 26.21 (BsmtFinSF2) + 24.25 (BsmtUnfSF) + 45.68 (SecondFlrSF) + 22.52 (GrLivArea) + 47.32 (MasVnrArea) – 12,206 (BedrromAbvGrd) + 3591.76 (TotRmsAbvGrd) + 22.29 (WoodDeckSF) + 40.60 (GarageArea) + 23.29 (TotalPorchSF) + 6528.68 (TotalFullBath) – 560.77 (TotalBath) – 56.20 (PoolArea) – 10.61 (MiscVal) + 95.39 (MoSold) – 1044.80 (YrSold)
The above is the final adjusted model with the three nonstatistically significant variables removed but including all BldgCategories. As shown, two of the BldgCategories are not reflected in the equation as they are perfectly multicollinarity.
Part C: Validation Framework
Cross validation of data is accomplished using a training/test split of the data. Taking a step back, we will split up the data as follows:
 Training/InSample: 70% of the data is used to train the model which you will see below at just over 2,000 observations
 Test/OutofSample: 30% of the data will be held back with 100% of the data being used for testing which you will see below at around 900 observations
We will then assess the differences in the goodnessoffit statistics to see how they differ from prior models using 100% of the data and evaluate the differences.
Step 8: Training Data
Using the automated variable selection, a summary of each of the techniques are provided below. In each of the models, only the training data is used which consists of 1,672 observations instead of the full dataset which contained 2,420 observations but includes missing values.
Adjusted RSquare
Using the training observations on the data, the Adjusted RSquare model summary is as follows:
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  843974  120147  7.02  <.0001  0 
BldgCategory1  1  24100  4398.68558  5.48  <.0001  3.22678 
BldgCategory2  1  8479.20085  3341.40474  2.54  0.0113  2.05397 
BldgCategory3  1  2896.38863  1795.83228  1.61  0.1070  2.42093 
RoofStyleCategory2  1  7129.59882  1371.01316  5.20  <.0001  1.27554 
LotFrontage  1  123.84500  56.34235  2.20  0.0281  1.97656 
LotArea  1  0.40106  0.15809  2.54  0.0113  1.40113 
YearRemodel  1  430.77533  60.66339  7.10  <.0001  1.82482 
BsmtFinSF1  1  8.54006  2.93508  2.91  0.0037  2.09251 
TotalBsmtSF  1  20.89315  4.00546  5.22  <.0001  3.86713 
FirstFlrSF  1  70.36493  5.55393  12.67  <.0001  5.61303 
SecondFlrSF  1  64.55788  4.51295  14.31  <.0001  4.03614 
MasVnrArea  1  47.48084  6.52991  7.27  <.0001  1.60962 
LowQualFinSF  1  37.21592  20.54811  1.81  0.0703  1.06074 
BedroomAbvGr  1  12253  1711.38313  7.16  <.0001  2.16414 
TotRmsAbvGrd  1  3437.03017  1215.36878  2.83  0.0047  4.09604 
WoodDeckSF  1  27.52888  8.75038  3.15  0.0017  1.21352 
GarageArea  1  47.20537  5.94685  7.94  <.0001  1.93937 
TotalPorchSF  1  27.81892  9.46298  2.94  0.0033  1.22996 
TotalFullBath  1  10092  2034.89844  4.96  <.0001  2.65568 
TotalHalfBath  1  3401.02263  2318.13769  1.47  0.1425  1.75272 
PoolArea  1  83.69329  27.47086  3.05  0.0024  1.12358 
MiscVal  1  23.66412  1.96647  12.03  <.0001  1.08594 
HouseAge  1  424.70078  48.81669  8.70  <.0001  2.54090 
Mallow’s Cp
Using the training observations data, the Mallow’s Cp model summary is as follows:
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  840908  120154  7.00  <.0001  0 
BldgCategory1  1  24294  4388.81865  5.54  <.0001  3.21114 
BldgCategory2  1  7687.49655  3245.97774  2.37  0.0180  1.93762 
BldgCategory3  1  2984.57200  1785.88782  1.67  0.0949  2.39332 
RoofStyleCategory2  1  7211.79054  1352.13044  5.33  <.0001  1.24019 
LotFrontage  1  119.52279  55.36294  2.16  0.0310  1.90774 
LotArea  1  0.42285  0.15738  2.69  0.0073  1.38814 
YearRemodel  1  431.13897  60.67388  7.11  <.0001  1.82478 
BsmtFinSF1  1  9.55869  2.85710  3.35  0.0008  1.98208 
TotalBsmtSF  1  22.38886  3.04707  7.35  <.0001  2.23712 
GrLivArea  1  68.00277  3.96977  17.13  <.0001  4.65655 
MasVnrArea  1  48.06803  6.51989  7.37  <.0001  1.60410 
LowQualFinSF  1  30.07548  20.49185  1.47  0.1424  1.05455 
BedroomAbvGr  1  12345  1691.18699  7.30  <.0001  2.11259 
TotRmsAbvGrd  1  3366.84955  1214.59310  2.77  0.0056  4.08933 
WoodDeckSF  1  28.40475  8.73491  3.25  0.0012  1.20879 
GarageArea  1  47.13780  5.94162  7.93  <.0001  1.93526 
TotalPorchSF  1  28.21510  9.44812  2.99  0.0029  1.22565 
TotalFullBath  1  8971.49719  1892.49088  4.74  <.0001  2.29615 
PoolArea  1  84.39866  27.44621  3.08  0.0021  1.12115 
MiscVal  1  23.49573  1.96161  11.98  <.0001  1.08019 
HouseAge  1  445.57359  46.41973  9.60  <.0001  2.29666 
RSquare
Using the training observations data, the RSquare model summary is as follows:
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  B  271537  1484146  0.18  0.8549  0 
BldgCategory1  1  23989  4406.21865  5.44  <.0001  3.23276 
BldgCategory2  1  8478.80283  3351.17430  2.53  0.0115  2.06276 
BldgCategory3  1  2859.88123  1798.57360  1.59  0.1120  2.42453 
BldgCategory4  0  0  .  .  .  . 
BldgCategory5  0  0  .  .  .  . 
RoofStyleCategory1  1  4257.98288  6825.03906  0.62  0.5328  8.53205 
RoofStyleCategory2  1  5053.32160  3602.34296  1.40  0.1609  8.79224 
LotFrontage  1  124.71032  56.41024  2.21  0.0272  1.97822 
LotArea  1  0.39253  0.15905  2.47  0.0137  1.41602 
YearBuilt  1  425.87291  49.72948  8.56  <.0001  2.63086 
YearRemodel  1  433.10734  61.39363  7.05  <.0001  1.86609 
BsmtFinSF1  B  29.44503  4.34360  6.78  <.0001  4.57558 
BsmtFinSF2  B  22.30864  7.29786  3.06  0.0023  1.42093 
BsmtUnfSF  B  20.74245  4.03077  5.15  <.0001  3.58348 
TotalBsmtSF  0  0  .  .  .  . 
FirstFlrSF  B  32.93950  20.96756  1.57  0.1164  79.87514 
SecondFlrSF  B  27.01162  20.57882  1.31  0.1895  83.79231 
GrLivArea  B  37.51693  20.58794  1.82  0.0686  125.09373 
MasVnrArea  1  47.34455  6.53894  7.24  <.0001  1.61155 
LowQualFinSF  0  0  .  .  .  . 
BedroomAbvGr  1  12232  1713.67326  7.14  <.0001  2.16654 
TotRmsAbvGrd  1  3446.03263  1217.54125  2.83  0.0047  4.10426 
WoodDeckSF  1  27.28586  8.80642  3.10  0.0020  1.22719 
GarageArea  1  47.71718  5.97396  7.99  <.0001  1.95403 
TotalPorchSF  1  27.57210  9.49428  2.90  0.0037  1.23617 
TotalFullBath  B  6748.88770  2463.70706  2.74  0.0062  3.88676 
TotalBath  B  3291.27701  2356.11755  1.40  0.1626  5.26565 
TotalHalfBath  0  0  .  .  .  . 
PoolArea  1  85.63831  27.60710  3.10  0.0020  1.13297 
MiscVal  1  23.62538  1.96873  12.00  <.0001  1.08673 
MoSold  1  275.81048  359.89128  0.77  0.4436  1.04079 
YrSold  1  980.82714  739.07943  1.33  0.1847  1.04896 
Forward
Using the training observations data, the Forward model summary is as follows:
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  1731524  114861  15.07  <.0001  0 
BldgCategory1  1  24334  4390.34749  5.54  <.0001  3.21131 
BldgCategory2  1  7690.63757  3247.02389  2.37  0.0180  1.93762 
BldgCategory3  1  3002.36345  1786.38089  1.68  0.0930  2.39310 
RoofStyleCategory2  1  7195.33895  1352.51100  5.32  <.0001  1.24009 
LotFrontage  1  120.09066  55.38180  2.17  0.0303  1.90781 
LotArea  1  0.42441  0.15744  2.70  0.0071  1.38827 
YearBuilt  1  445.28878  46.67546  9.54  <.0001  2.31894 
YearRemodel  1  429.39054  60.83216  7.06  <.0001  1.83314 
BsmtFinSF1  1  9.50117  2.85775  3.32  0.0009  1.98170 
TotalBsmtSF  1  22.37203  3.04882  7.34  <.0001  2.23825 
GrLivArea  1  37.78666  20.55161  1.84  0.0662  124.72222 
MasVnrArea  1  48.23994  6.52019  7.40  <.0001  1.60321 
BedroomAbvGr  1  12354  1691.80579  7.30  <.0001  2.11277 
TotRmsAbvGrd  1  3374.48124  1215.02275  2.78  0.0055  4.08958 
WoodDeckSF  1  28.54544  8.73726  3.27  0.0011  1.20866 
GarageArea  1  47.16336  5.94525  7.93  <.0001  1.93637 
TotalPorchSF  1  28.42878  9.45734  3.01  0.0027  1.22725 
TotalFullBath  1  8947.56166  1894.29789  4.72  <.0001  2.29905 
PoolArea  1  83.72517  27.45360  3.05  0.0023  1.12104 
MiscVal  1  23.48982  1.96224  11.97  <.0001  1.08019 
TotalFlrSF  1  30.24018  20.49786  1.48  0.1403  123.30754 
Backward
Using the training observations data, the Backward model summary is as follows:
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  869800  120710  7.21  <.0001  0 
BldgCategory1  1  18744  2991.54371  6.27  <.0001  1.47787 
BldgCategory2  1  8897.32941  2889.24938  3.08  0.0021  1.52065 
RoofStyleCategory2  1  7524.63928  1350.08448  5.57  <.0001  1.22477 
YearRemodel  1  450.24030  60.83855  7.40  <.0001  1.81739 
BsmtFinSF1  1  31.78586  3.34469  9.50  <.0001  2.69069 
BsmtFinSF2  1  26.84265  6.69302  4.01  <.0001  1.18531 
BsmtUnfSF  1  22.60594  3.04779  7.42  <.0001  2.03191 
GrLivArea  1  74.71645  3.15531  23.68  <.0001  2.91406 
MasVnrArea  1  47.22611  6.52973  7.23  <.0001  1.59376 
BedroomAbvGr  1  10932  1541.43217  7.09  <.0001  1.73845 
WoodDeckSF  1  29.99457  8.81104  3.40  0.0007  1.21835 
GarageArea  1  46.28525  5.90316  7.84  <.0001  1.89226 
TotalPorchSF  1  26.82969  9.46670  2.83  0.0047  1.21886 
TotalFullBath  1  8968.61730  1926.90743  4.65  <.0001  2.35796 
PoolArea  1  98.53460  26.97263  3.65  0.0003  1.07258 
MiscVal  1  23.02329  1.96539  11.71  <.0001  1.07412 
HouseAge  1  450.02291  46.25954  9.73  <.0001  2.25931 
Stepwise
Using the training observations data, the Stepward model summary is as follows:
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  1716883  113840  15.08  <.0001  0 
BldgCategory1  1  17936  2997.75814  5.98  <.0001  1.48649 
BldgCategory2  1  10877  2942.69344  3.70  0.0002  1.58005 
RoofStyleCategory2  1  7150.53026  1352.59512  5.29  <.0001  1.23138 
YearBuilt  1  438.07997  46.34024  9.45  <.0001  2.26941 
YearRemodel  1  430.51754  60.87042  7.07  <.0001  1.82232 
BsmtFinSF1  1  8.89032  2.85083  3.12  0.0018  1.95801 
TotalBsmtSF  1  23.22825  3.01450  7.71  <.0001  2.17251 
MasVnrArea  1  46.77977  6.52778  7.17  <.0001  1.59546 
BedroomAbvGr  1  12773  1690.26994  7.56  <.0001  2.09387 
TotRmsAbvGrd  1  3298.05155  1203.11458  2.74  0.0062  3.98115 
WoodDeckSF  1  30.70852  8.74739  3.51  0.0005  1.20281 
GarageArea  1  46.23613  5.90182  7.83  <.0001  1.89455 
TotalPorchSF  1  30.09617  9.44231  3.19  0.0015  1.21461 
TotalFullBath  1  9240.18250  1897.63053  4.87  <.0001  2.29066 
PoolArea  1  86.41275  26.97373  3.20  0.0014  1.07445 
MiscVal  1  23.36203  1.96651  11.88  <.0001  1.07713 
TotalFlrSF  1  68.21494  3.92390  17.38  <.0001  4.48636 
When comparing the training data set to the full data set, a number of GoodnessofFit statistics change including:
 Root MSE training data increases from an average of 37,000 to 39,053 over the full data which is a 5.26% increase. While this is relatively small increase, we prefer a smaller Root MSE, using the training data decreases the fit of our model.
 Cp also increased from an average value of 20.598 to 21.61 which is also a relatively small increase of 4.7% over the full data. Again, we prefer to have a smaller Cp value.
 Small difference in the Adjusted RSquare moving from .802 to .79 in the training vs. full data set respectively which is a decrease of 1.07%. Unlike the Root MSE and Cp, we prefer a higher Adjusted RSquare value.
 The F Value moved from an average of 458 in the full data to an average of 327 in the training set which is a 39.9% change. A larger F value is preferred and the change in the number of observations significantly changes the indications in the fit of our model.
 AIC range in the full model had an average of 50,936 but declined to an average of 35,377 in the training sample which is a decline of almost 44%. This is expected due to the number of observations (n) being included in the formula and thus, as the number declines from the full data set to the training set, it is natural for the AIC volume to decline.
 AIC equation
 BIC range in the full model had an average of 50,936.31 which also declined to an average of 35,380 in the training sample which is also a decline of 44% this is expected as the BIC formula is based on the number of observations (n) thus, as the number of observations decline from the training data from the full data, the BIC statistic would also decline.
 BIC equation:
 There was also a decrease in the number of parameters used with the average in the full model of 23.67 and only 21.33 in the training sample which is approximately a 10.9% decrease.
Overall, the training data versus the full dataset have some minor differences in the majority of the goodnessoffit statistics. This is expected as the training set data is not going to be as accurate as utilizing all of the all the data but as long as the jumps are not significantly different then we know the model is performing well. As described above, the only significant jumps were in the AIC and BIC statistics which is a natural reflection of the number of observations. Thus, our model is performing well.
Step 9: Summary of Results and Comparisons
Model  Adj RSquare  F Value  AIC  BIC  Parameters  MSE  MAE 
Adjusted RSquare  0.7946  282.1  35,374.521  35,377.375  24  1,095,324,801.4  22,839.01 
Mallow’s Cp  0.7946  308.73  35,372.950  35,375.580  22  1,090,304,849.9  22,757.98 
RSquared  0.7943  239.98  35,377.519  35,380.472  28  1,097,933,922.6  22,829.66 
Forward  0.7929  377.41  35,378.720  35,380.870  18  1,136,636,015.7  23,222.32 
Backward  0.7926  376.62  35,381.500  35,383.590  18  1,149,916,522.5  23,222.32 
Stepwise  0.7929  377.41  35,378.720  35,380.870  18  1,136,636,015.7  23,079.73 
Assigning The average squared error (MSE) and the average absolute error (MAE) for the test sample and the validation sample are as follows:
Model  Obs  train  _TYPE_  _FREQ_  MSE_1  MAE_1 
Adjusted R Square: InSample  1  0  1  891  1,095,324,801.4  22,839.01 
Adjusted R Square: OutSample  2  1  1  2039  1,496,167,717.1  24,052.89 
Cp: InSample  1  0  1  891  1,090,304,849.9  22,757.98 
Cp: OutSample  2  1  1  2039  1,498,529,842.4  24,026.27 
RSquare: InSample  1  0  1  891  1,097,933,922.6  22,829.66 
RSquare: OutSample  2  1  1  2039  1,494,877,983.8  24,076.99 
Forward: InSample  1  0  1  891  1,136,636,015.7  23,079.73 
Forward: OutSample  2  1  1  2039  1,458,140,826.3  23,755.93 
Backward: InSample  1  0  1  891  1,149,916,522.5  23,222.32 
Backward: OutSample  2  1  1  2039  1,460,556,185.2  23,869.79 
Stepwise: InSample  1  0  1  891  1,136,636,015.7  23,079.73 
Stepwise: OutSample  2  1  1  2039  1,458,140,826.3  23,755.93 
In all of the cases above, the insample data performs worse than the outofsample data which is no surprise as we would expect the training set to be a bit worse than the test data. For the average squared error (MSE) the insample data was on average approx. 75.7% of the full data MSE. Additionally, the average absolute error (MAE) on the insample data was on average 96% of the outofsample data.
Additionally, evaluating the Average Mean Square error the lowest insample model was the Cp model whereas the best MSE out of sample was the Forward model. With the Average Absolute Error, the best inmodel was again the Cp model but the best outofsample was again the Forward model.
Step 10: Operational Validation
To change the format of “Prediction_Grade” in SAS we add in a format command as follows:
if yhat >= (SalePrice *0.9) and yhat <= (SalePrice *1.1) then
Prediction_Grade=”Grade 1″;
FORMAT Prediction_Grade $7.;
We can verify that it works in the output data variable overview:
From a business perspective, the models from both the training and test set predict house sale price within various levels of frequency and error as follows:
Train = 0 (training sample – 70%) training set  Train = 1 (All data – 100%) test set  


Hence, based on the training set, 430 or 48.26% house sale price we are able to predict within plus/minus 10% error, 162 or 18.18% of the houses sale price we are able to predict between, but not equal to, 10% – 15% error and 299 houses or 33.56% of housing sale prices we are able to predict with more than 15% error.
Similarly, based on the test set, 1,000 or 49.04% house sale price we are able to predict within plus/minus 10% error, 335 or 16.43% of the houses sale price we are able to predict between 10%15% error and 704 houses or 34.53% of housing sale prices we are able to predict with more than 15% error.
Thus, depending on the expected deliverables of the model, we will be able to determine if the model is complete or whether further reiterations are required.
Step 11: Final Model
Based on step 9, the Forward model is the best model. However, we did incorporate categorical variables and thus, need to ensure all categorical variables including the RoofStyleCategory# are added back into the model and then applied to the test data. The results are as follows:


Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  B  341417  1039835  0.33  0.7427  0 
BldgCategory1  1  22972  3032.18797  7.58  <.0001  2.94959 
BldgCategory2  0  0  .  .  .  . 
BldgCategory3  1  3971.02347  1621.08445  2.45  0.0144  1.91404 
BldgCategory4  1  2936.69439  969.46734  3.03  0.0025  2.46653 
BldgCategory5  0  0  .  .  .  . 
RoofStyleCategory1  1  1623.80456  4884.52714  0.33  0.7396  8.81513 
RoofStyleCategory2  1  5975.39267  2560.67281  2.33  0.0197  8.98265 
LotArea  1  0.27286  0.09438  2.89  0.0039  1.23545 
YearBuilt  1  445.67105  35.41682  12.58  <.0001  2.56547 
YearRemodel  1  486.79230  43.70459  11.14  <.0001  1.86135 
BsmtFinSF1  1  37.29072  3.10915  11.99  <.0001  4.48675 
BsmtFinSF2  1  24.83558  4.86118  5.11  <.0001  1.52529 
BsmtUnfSF  1  24.03698  2.89111  8.31  <.0001  3.60834 
FirstFlrSF  B  40.82363  15.05372  2.71  0.0067  76.55732 
SecondFlrSF  B  42.24975  14.82388  2.85  0.0044  90.19210 
GrLivArea  1  25.11865  14.80217  1.70  0.0898  124.91251 
MasVnrArea  1  41.28416  4.49551  9.18  <.0001  1.45294 
BedroomAbvGr  1  12235  1201.70744  10.18  <.0001  2.21617 
TotRmsAbvGrd  1  3105.88521  865.17930  3.59  0.0003  4.14841 
WoodDeckSF  1  22.67392  5.80752  3.90  <.0001  1.21089 
GarageArea  1  40.36999  4.17657  9.67  <.0001  1.80983 
TotalPorchSF  1  23.56863  6.83097  3.45  0.0006  1.21094 
TotalFullBath  1  6582.18214  1754.82039  3.75  0.0002  3.91453 
TotalBath  1  567.87782  1650.52211  0.34  0.7308  5.38145 
PoolArea  1  54.70574  19.26009  2.84  0.0045  1.06203 
MiscVal  1  10.80135  1.19592  9.03  <.0001  1.03643 
MoSold  1  36.08454  251.33838  0.14  0.8859  1.04020 
YrSold  1  1087.09519  516.66708  2.10  0.0355  1.03946 
TotalFlrSF  0  0  .  .  .  . 
Sale Price = $341,417 + 22,972 (BldgCategory1) – 3,971.02 (BldgCategory3) + 2,936.69 (BldgCategory4) – 1,623.80 (RoofStyleCategory1) + 5,975.39 (RoofStyleCategory2) +.27286 (LotArea) + 445.67 (YearBuilt) + 486.79 (YearRemodel) + 37.29 (BsmtFinSF1) + 24.84 (BsmtFinSF2) + 24.04 (BsmtUnfSF) + 40.82 (FirstFlrSF) + 42.25 (SecondFlrSF) + 25.12 (GrLivArea) + 41.28 (MasVnrArea) – 12,235 (BedroomAbvGr) + 3,105.88 (TotRmsAbvGrd) + 22.67 (WoodDeckSF) + 40.37 (GarageArea) + 23.57 (TotalPorchSF) + 6,582.18 (TotalFullBath) – 567.88 (TotalBath) – 54.71 (PoolArea) – 10.80 (MiscVal) + 36.08 (MoSold) – 1,087.09 (YrSold)
Step 12: Reflection/Conclusion
In conclusion, the housing data from Ames, has been utilized to develop numerous regression models to determine the mean sales price of a house based on numerous variables. The variables range from highly correlated, continuous variables to categorical variables with low correlations. In this installment, we continued building the model by assigning dummy codes to categorical variables and the use of automated variable selection procedures. A validation framework was conducted and the automated variable selected variables were compared using the training sample versus the test data. Operational validation based on the values were compared and analyzed. The final and best model was also provided along with the results.
Understanding the goals of the model are key in determining if they should be employed and to what degree. Additionally, the use of these techniques is more of an art than a science. If the model already meets the needs of its purpose than perhaps these methods are not required. However, if the model is not appropriate than utilizing these techniques to transform the model to become more linear is appropriate.
Employing a process to help determine which observations are outliers and requiring an evaluation can to see if it is valid or if it is an error. If an observation is an error, then it should be removed. However, removing legitimate observations will distort the model and should not be removed. If done correctly, the transformation and deletion of outliers will benefit the model by creating a better, more appropriate model that is more linear. However, it does require additional processes and a certain finesse to determine the appropriate thresholds values. As easily observed from the above discussion, the choice of thresholds can drastically impact the number of observations it impacts and needs to be carefully selected and employed.
As reflected in this iteration in the development of the regression model based on the Ames housing data, we came up with the best model however, additional steps can still be taken. Outliers still exist and need to be further investigated to determine whether the observations are legitimate or not and should be discarded. This process while potentially time consuming, could provide a lot of insight that may or may not be very valuable. This additional insight may assist in uncovering other variables that may reduce the 20% of variability that is still not accounted for. Violation of the assumptions of constant variance and normality as seen in the plots still exists. Additionally, transformation on the data can be utilized but for simplicity and ease of understanding of potential home buyers has been left in its original dollar values. However, transforming variables will assist in an increase in the goodness of fit and potentially creating a better fitting model. Thus, if the intended audience is more sophisticated and well versed in interpreting log values, than I would recommend transforming the variables. If not, and the current model meets the desired audience and requirements than an outlier investigation as detailed above should be undertaken.