Automated Variable Selection
Automated Variable Selection
The Amex, Iowa housing data set build has been utilized to develop various iterative regression models to determine the mean sales price of a house based on numerous variables. The variables range correlated, continuous variables to categorical variables. In this installment, we continue building the model using raw categories and later, the dummy coding. Automated variable selection procedures using various selection procedures will be conducted where we will view some of the categorical variables being incorporated into the model including the entire category full variable and not just a few of the automated selected categories. Finally, a validation framework will be conducted whereby the automated variable selection will again be applied to compare the results of the training sample versus the test data. Operational validation will also be carried out with the resulting models being compared to the full test data. Finally, the best model will be provided along with the results and some potential next steps for consideration.
Part A: Dummy Coding of Categorical Variables
Part 1: Categorical Automated Variable Selection
The categorical value, BldgType has been chosen for further analysis. BldgType consists of five different types:
- 1Fam: Single-family Detached
- 2FmCon: Two-family Conversion; originally built as one-family dwelling
- Duplx: Duplex
- TwnhsE: Townhouse End Unit
- TwnhsI: Townhouse Inside Unit
The frequency of each BldgType is as follows:
BldgType | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
1Fam | 2425 | 82.76 | 2425 | 82.76 |
2fmCon | 62 | 2.12 | 2487 | 84.88 |
Duplex | 109 | 3.72 | 2596 | 88.60 |
Twnhs | 101 | 3.45 | 2697 | 92.05 |
TwnhsE | 233 | 7.95 | 2930 | 100.00 |
As we can see from above, there are a number of observations in each category with the 2 Family Complex having the least with just 62 observations. Due to this, all categories have been utilized. If a category had few observations they could be combined into one but as described is not the case. Based on the above, the means of each category are as follows:
Parameter Estimates | |||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 180677 | 2571.22135 | 70.27 | <.0001 |
BldgCategory | 1 | 91.31542 | 1619.89273 | 0.06 | 0.9550 |
0 | 1 1Fam | 2 2fmCon | 3 Duplex | |
Equation | $ 180,677.00 | $ 180,677.00 | $ 180,677.00 | $ 180,677.00 |
2825.27 x Cat# | – | 91.32 | 182.63 | 273.95 |
Predicted Value | $ 180,677.00 | $ 180,768.32 | $ 180,859.63 | $ 180,950.95 |
Mean Value | $ 135,934.06 | $ 184,812.04 | $ 125,581.71 | $ 139,808.94 |
Variance | $ 44,742.94 | $ (4,043.72) | $ 55,277.92 | $ 41,142.01 |
As shown above, for category 1 – 1Fam, the predicted value is $180,677 + (1 x $91.32) = $180,768.32 however, as shown earlier the mean value is $184,812.04 for a difference of -$4,043.72. Therefore, we know the model is wrong.
Similarly, for a 2famCon the equation from is $180,677 + (2 x $182.63) = $180,859.63 but the mean value is $125,581.71. The model is supposed to predict the mean values which clearly, it is not so again, the model is wrong.
Throughout this iteration of the regression models, transformation of values using log have not been utilized so the results can be easily understood by all potential home buyers. Using log can assist in dealing with skewed data but can also be confusing to understand.
A simple linear regression model would be:
Categorical or qualitative variables between the response variable and the regressor variable donot go through the mean of Y in each category as shown above. As we can see above with the Predicted Value plot, the observations do go through the intercept of $180,677 but do not go through the rest of the values as outlined in the variance chart above.
Categorical Variable Conclusion
We have chosen the BldgType categorical variable as the variable to predict SalePrice. BldgType has five different types. Based on the mean values, a simple linear regression model was provided to determine SalePrice based on BldType. Based on a comparison between the mean value and the expected value and the resulting variances, we know the model is incorrect. However, due to the type of variable – BlgType categorical, we need to assign dummy codes to the variables in order to be able to interpret the variables.
Step 2: Dummy Code
As qualitative variables have no natural scale of measurement, we need to assign a set of levels to them to account for the effect that the variable may have on the response through the use of indicator variables or often referred to as dummy variables (text, p. 260).
Using SAS, several dummy codes are assigned to the categorical variables in BldgType as follows:
If BldgType = ‘1Fam’ then BldgCategory1 = 1; else BldgCategory1 = 0;
If BldgType = ‘2FmCon’ then BldgCategory2 = 2; else BldgCategory2 = 0;
If BldgType = ‘Duplex’ then BldgCategory3 = 3; else BldgCategory3 = 0;
If BldgType = ‘TwnhsE’ then BldgCategory4 = 4; else BldgCategory4 = 0;
If BldgType = ‘TwnhsI’ then BldgCategory = 5; else BldgCategory5 = 0;
While there are five BldType’s one variable is used as the basis for interpretation. Thus, while we five variables, we base our model on four variables and have used BldgCategory5 as the basis of our interpretation.
The summary table is as follows:
|
|
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | 131996 | 6152.08881 | 21.46 | <.0001 | 0 |
BldgCategory1 | 1 | 52816 | 6355.48738 | 8.31 | <.0001 | 2.73653 |
BldgCategory2 | 0 | 0 | ||||
BldgCategory3 | 1 | 2604.19332 | 3239.45942 | 0.80 | 0.4215 | 1.60663 |
BldgCategory4 | 1 | 15079 | 2005.08246 | 7.52 | <.0001 | 2.23625 |
SalePrice = . Therefore, the average sale price by BldgCategroy is as follows:
- BldgCategory1: SalePrice = $131,996 + ($52,816 x 1) for Catgory1 = $184,812
- BldgCategory2: SalePrice = $131,996 + $0 for Category2 = $131,996**
- BldgCategory3: SalePrice = $131,996 + ($2,604.19 x 3) for Category3 = $139,808.57
- BldgCategory4: SalePrice = $131,996 + ($15,079 x 4) for Category 4 = $192,312
**However, in BldgCategory2 SAS alerts us to an error that the combination of variables results in perfect multicollinearity where the variance is infinite and therefore, cannot solve for As such, the variable is removed from the model which should have been done in the first place. Based on this, BldgCategory2 will be eliminated in the analysis.
It should be noted, that a BldgCategory can only have one value and thus, we can determine the SalePrice and cannot have two BldgCategory for example, a house cannot be categorized as a 1Fam and a Duplex.
Evaluating the above model for each BldgCategory:
Bldg Category | Equation | Predicted | Mean |
1 1Fam | 131,996 + 52,816 | $ 184,812 | $ 184,812 |
2 2fmCon | Eliminated – multicollinearity | N/A | $ 125,581 |
3 Duplex | 131,996 + 0 + (3*2604.19) | $ 139,808 | $ 139,808 |
4 TwnhsE | 131,996 + 0 + 0 + (4*15,079) | $ 192,312 | $ 192,312 |
The above reflection of the model if say, the BldgCategory is 1 (1Fam) the equation and the predicted value which is then compared to the mean value as detailed in Step 1. As the summary table reflects, the predicted and mean values match this, each of them matches unlike in step 1, the line goes through the mean sales price.
As the model is predicting the mean values, we know that the model is correct.
Looking at the ODS generated results, we also gain insight into how the model is performing.
Reviewing the above plots, we can see that the residuals do not form a straight line but if a curve was drawn, would curve upwards to the third line of values. The Q-Q Plot does not fit the line with heavy and light tails reflecting that there are more extreme values than would be expected if they were a normal distribution. A few outliers are also seen. The predicted values versus the sale price line do not fit the 45-degree line but Cook’s D scale looks good as the values are low. However, we also have a low Adjusted R-Square value of 0.0333.
Task 3: Hypothesis Test for each beta
Testing the null hypothesis that all of the slopes are zero, the null hypothesis is:
Ho: ß1 = ß3 = ß4 = 0
Note that the above does not include as ß2 it has been eliminated from the model as described above.
H1: ß1 ≠ 0 or ß3≠ 0 0 or ß≠ 0 at least one mean differs from zero.
The p value for ß1, ß3, ß4 are all <.0001
Rejection of H0 implies that at least one of the regressors contributes significantly to the model.
As we can see in the summary table, the p value is <.0001 in each category except BuildingCategory1, BldgCategory3 which exceeds the .0001 threshold.
Task 4: Dummy Coded Variables v2
A second categorical variable, ‘RoofStyle’ was selected. Below are a list of the categories and frequencies of each:
RoofStyle | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
Flat | 20 | 0.68 | 20 | 0.68 |
Gable | 2321 | 79.22 | 2341 | 79.90 |
Gambre | 22 | 0.75 | 2363 | 80.65 |
Hip | 551 | 18.81 | 2914 | 99.45 |
Mansar | 11 | 0.38 | 2925 | 99.83 |
Shed | 5 | 0.17 | 2930 | 100.00 |
With the frequencies of Flat, Gambre, Mansar and Shed being very low, only two categories were assigned values in SAS as follows:
- If RoofStyle = ‘Gable’ then RoofStyleCategory1 = 1; else RoofStyleCategory1 = 0;
- If RoofStyle = ‘Hip’ then RoofStyleCategory2 = 2; else RoofStyleCategory2 = 0;
Based on these two categories, the related statistical information is as follows:
|
|
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | 220753 | 3129.61887 | 70.54 | <.0001 | 0 |
RoofStyleCategory1 | 1 | -50441 | 3516.31434 | -14.34 | <.0001 | 1.00000 |
Dummy Code Conclusion
In summary, an analysis on the categorical value, BldgType was conducted using the SAS Proc Sort and Means. However, because the predicted values did not go through the mean values, we know that dummy variables are required. As such, a transformation of the BldgType was dummy coded so that the statistical information could be interpreted. Additionally, a comparison of the dummy coded variables was compared to the predicted mean reflecting that the model was correct. Hypothesis tests were provided for each of the betas reflecting the threshold of when to accept or reject the null hypothesis. Finally, a second categorical variable, RoofStyle was selected and also dummy coded with two main categories being utilized ‘Gable’ and ‘Hip’ as they accounted for 98% of the observation on the types of roofs.
Part B: Automated Variable Selection Procedures
Part 5: Automated Variable selection
Utilizing a multitude of continuous predictor categorical variables that have been dummy coded, several automated variable selection criteria were utilized to various models as follows:
proc reg data = temp outest = stepwise_summary;
- adjrsq: model SalePrice = BldgCategory1 BldgCategory2 BldgCategory3 BldgCategory4 BldgCategory5 RoofStyleCategory1 RoofStyleCategory2 LotFrontage LotArea YearBuilt YearRemodel BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF FirstFlrSF SecondFlrSF GrLivArea MasVnrArea LowQualFinSF BedroomAbvGr TotRmsAbvGrd WoodDeckSF GarageArea TotalPorchSF TotalFullBath TotalBath TotalHalfBath PoolArea MiscVal MoSold YrSold TotalFlrSF HouseAge / selection = adjrsq slentry=0.01 slstay=0.01 AIC VIF BIC MSE; output out=stepwise_out pred=yhat residual = resid ucl=ucl lcl=lcl cookd=cook covratio=cov dffits=dfits press=prss;
- cp: model SalePrice = BldgCategory1 …selection = cp slentry=0.01 slstay=0.01 …
- rsquare: model SalePrice = BldgCategory1 … selection = rsquare slentry=0.01 …
- Forward: model SalePrice = BldgCategory1 …selection = forward slentry=0.01 slstay=0.01 …
- Backward: model SalePrice = BldgCategory1 … selection = backward slentry=0.01 slstay=0.01 …
- stepwise: model SalePrice = BldgCategory1 … selection = stepwise slentry=0.01 slstay=0.01 …
2,930 observations were read, 2,420 were used and 510 had missing values.
Adjusted R-Square
The results for Adjusted R-Square:
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | -861952 | 95368 | -9.04 | <.0001 | 0 |
BldgCategory1 | 1 | 26367 | 3400.45040 | 7.75 | <.0001 | 3.02363 |
BldgCategory2 | 1 | -5582.03220 | 2617.46948 | -2.13 | 0.0331 | 1.77428 |
BldgCategory3 | 1 | 3773.79916 | 1402.94914 | 2.69 | 0.0072 | 2.27981 |
RoofStyleCategory1 | 1 | -6051.65245 | 5661.14478 | -1.07 | 0.2852 | 9.36908 |
RoofStyleCategory2 | 1 | 4053.24568 | 2971.01764 | 1.36 | 0.1726 | 9.62305 |
LotFrontage | 1 | -119.37021 | 44.25236 | -2.70 | 0.0070 | 1.88815 |
LotArea | 1 | 0.48446 | 0.14046 | 3.45 | 0.0006 | 1.43016 |
YearRemodel | 1 | 445.58812 | 47.89292 | 9.30 | <.0001 | 1.85349 |
BsmtFinSF1 | 1 | 13.18428 | 2.23974 | 5.89 | <.0001 | 1.90176 |
TotalBsmtSF | 1 | 22.86654 | 2.45547 | 9.31 | <.0001 | 2.16019 |
GrLivArea | 1 | 68.42510 | 3.22248 | 21.23 | <.0001 | 4.68770 |
MasVnrArea | 1 | 46.83335 | 5.16944 | 9.06 | <.0001 | 1.53406 |
LowQualFinSF | 1 | -40.55798 | 15.95519 | -2.54 | 0.0111 | 1.06490 |
BedroomAbvGr | 1 | -12010 | 1330.85568 | -9.02 | <.0001 | 2.11672 |
TotRmsAbvGrd | 1 | 2881.62377 | 973.01912 | 2.96 | 0.0031 | 4.19309 |
WoodDeckSF | 1 | 28.06468 | 6.80774 | 4.12 | <.0001 | 1.19834 |
GarageArea | 1 | 43.63291 | 4.65153 | 9.38 | <.0001 | 1.88837 |
TotalPorchSF | 1 | 27.03325 | 7.57427 | 3.57 | 0.0004 | 1.20222 |
TotalFullBath | 1 | 7400.21598 | 1499.09748 | 4.94 | <.0001 | 2.23151 |
PoolArea | 1 | -80.30235 | 21.58185 | -3.72 | 0.0002 | 1.07930 |
MiscVal | 1 | -16.47373 | 1.53016 | -10.77 | <.0001 | 1.04987 |
HouseAge | 1 | -458.08384 | 36.28123 | -12.63 | <.0001 | 2.31941 |
Sale Price = $-861,952 + 26367 (BldgCategory1) – 5582.03 (BldgCategory2) + 3773.80 (BldgCategory3) -6051.65 (RoofStyleCategory1) + 4053.25 (RoofStyleCategory2) – 119.37 (LotFrontage) + .48446 (LotArea) + 445.58 (YearRemodel) + 13.18 (BsmtFinSF1) + 22.86 (BsmtFinSF1) + 68.42 (GrLivArea) + 46.83 (MasVnrArea) – 40.56 (LowQualFinSF) – 12010 (BedroomAbvGr) + 2881.62 (TotRmsAbvGrd) + 28.06 (WoodDeckSF) + 43.69 (GarageArea) + 27.03 (TotalPorchSF) + 7400.21 (TotalFullBath) – 80.30 (PoolArea) – 16.47 (MiscVal) – 458.08 (HouseAge
The summary tables is as follows:
|
|
Mallow’s Cp
The results for Mallow’s Cp:
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | -872421 | 94866 | -9.20 | <.0001 | 0 |
BldgCategory1 | 1 | 26470 | 3399.20678 | 7.79 | <.0001 | 3.02124 |
BldgCategory2 | 1 | -5447.83069 | 2614.53487 | -2.08 | 0.0373 | 1.77020 |
BldgCategory3 | 1 | 3815.00460 | 1402.46123 | 2.72 | 0.0066 | 2.27809 |
RoofStyleCategory2 | 1 | 7020.32653 | 1059.69735 | 6.62 | <.0001 | 1.22417 |
LotFrontage | 1 | -117.69740 | 44.22599 | -2.66 | 0.0078 | 1.88579 |
LotArea | 1 | 0.48918 | 0.14040 | 3.48 | 0.0005 | 1.42875 |
YearRemodel | 1 | 447.74919 | 47.85166 | 9.36 | <.0001 | 1.85019 |
BsmtFinSF1 | 1 | 13.18016 | 2.23981 | 5.88 | <.0001 | 1.90175 |
TotalBsmtSF | 1 | 22.80852 | 2.45494 | 9.29 | <.0001 | 2.15913 |
GrLivArea | 1 | 68.62406 | 3.21719 | 21.33 | <.0001 | 4.67206 |
MasVnrArea | 1 | 46.84375 | 5.16959 | 9.06 | <.0001 | 1.53406 |
LowQualFinSF | 1 | -40.96187 | 15.95119 | -2.57 | 0.0103 | 1.06430 |
BedroomAbvGr | 1 | -12037 | 1330.65771 | -9.05 | <.0001 | 2.11596 |
TotRmsAbvGrd | 1 | 2862.99096 | 972.89193 | 2.94 | 0.0033 | 4.19174 |
WoodDeckSF | 1 | 28.06673 | 6.80794 | 4.12 | <.0001 | 1.19834 |
GarageArea | 1 | 43.37188 | 4.64525 | 9.34 | <.0001 | 1.88316 |
TotalPorchSF | 1 | 26.96148 | 7.57420 | 3.56 | 0.0004 | 1.20213 |
TotalFullBath | 1 | 7379.01541 | 1499.01089 | 4.92 | <.0001 | 2.23112 |
PoolArea | 1 | -80.82871 | 21.57688 | -3.75 | 0.0002 | 1.07874 |
MiscVal | 1 | -16.49217 | 1.53010 | -10.78 | <.0001 | 1.04974 |
HouseAge | 1 | -454.46869 | 36.12434 | -12.58 | <.0001 | 2.29926 |
Sale Price = $-872421 + 26470 (BldgCategory1) – 5447.83 (BldgCategory2) + 3815.00 (BldgCategory3) -7020.33 (RoofStyleCategory2) – 117.70 (LotFrontage) + .48918 (LotArea) + 447.75 (YearRemodel) + 13.18 (BsmtFinSF1) + 22.80 (TotalBsmtSF) + 68.62 (GrLivArea) + 46.84 (MasVnrArea) – 40.96 (LowQualFinSF) – 12037 (BedroomAbvGr) + 2862.99 (TotRmsAbvGrd) + 28.01 (WoodDeckSF) + 43.37 (GarageArea) + 26.97 (TotalPorchSF) + 7379.01 (TotalFullBath) – 80.83 (PoolArea) – 16.49 (MiscVal) – 454.47 (HouseAge)
The summary tables is as follows:
|
|
R-Square
The results for R-Square:
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | B | -219912 | 1167254 | -0.19 | 0.8506 | 0 |
BldgCategory1 | 1 | 26253 | 3408.45137 | 7.70 | <.0001 | 3.03341 |
BldgCategory2 | 1 | -6028.14184 | 2695.47601 | -2.24 | 0.0254 | 1.87884 |
BldgCategory3 | 1 | 3696.74638 | 1411.21543 | 2.62 | 0.0089 | 2.30336 |
BldgCategory4 | 0 | 0 | . | . | . | . |
BldgCategory5 | 0 | 0 | . | . | . | . |
RoofStyleCategory1 | 1 | -5911.87612 | 5677.28515 | -1.04 | 0.2978 | 9.40871 |
RoofStyleCategory2 | 1 | 4046.55598 | 2978.40550 | 1.36 | 0.1744 | 9.65673 |
LotFrontage | 1 | -122.91538 | 45.12727 | -2.72 | 0.0065 | 1.96066 |
LotArea | 1 | 0.47268 | 0.14148 | 3.34 | 0.0008 | 1.44882 |
YearBuilt | 1 | 445.49962 | 38.75422 | 11.50 | <.0001 | 2.63604 |
YearRemodel | 1 | 447.66425 | 48.24002 | 9.28 | <.0001 | 1.87769 |
BsmtFinSF1 | B | 34.38828 | 3.53030 | 9.74 | <.0001 | 4.71783 |
BsmtFinSF2 | B | 21.21424 | 5.65386 | 3.75 | 0.0002 | 1.48947 |
BsmtUnfSF | B | 21.72594 | 3.28476 | 6.61 | <.0001 | 3.75777 |
TotalBsmtSF | 0 | 0 | . | . | . | . |
FirstFlrSF | B | 42.46746 | 16.37491 | 2.59 | 0.0096 | 73.54948 |
SecondFlrSF | B | 38.81104 | 16.05077 | 2.42 | 0.0157 | 80.87806 |
GrLivArea | B | 27.69946 | 16.00018 | 1.73 | 0.0835 | 115.39580 |
MasVnrArea | 1 | 46.56235 | 5.18135 | 8.99 | <.0001 | 1.53887 |
LowQualFinSF | 0 | 0 | . | . | . | . |
BedroomAbvGr | 1 | -11894 | 1348.83867 | -8.82 | <.0001 | 2.17111 |
TotRmsAbvGrd | 1 | 2896.26872 | 976.52027 | 2.97 | 0.0030 | 4.21710 |
WoodDeckSF | 1 | 27.67704 | 6.86484 | 4.03 | <.0001 | 1.21673 |
GarageArea | 1 | 43.68793 | 4.66553 | 9.36 | <.0001 | 1.89696 |
TotalPorchSF | 1 | 26.79651 | 7.60402 | 3.52 | 0.0004 | 1.20990 |
TotalFullBath | B | 6282.23495 | 1970.20252 | 3.19 | 0.0014 | 3.84876 |
TotalBath | B | 1715.20474 | 1857.44817 | 0.92 | 0.3559 | 5.20762 |
TotalHalfBath | 0 | 0 | . | . | . | . |
PoolArea | 1 | -81.57447 | 21.71996 | -3.76 | 0.0002 | 1.09155 |
MiscVal | 1 | -16.53651 | 1.53421 | -10.78 | <.0001 | 1.05389 |
MoSold | 1 | -35.18158 | 282.09733 | -0.12 | 0.9008 | 1.04584 |
YrSold | 1 | -768.34423 | 580.40773 | -1.32 | 0.1857 | 1.04540 |
Sale Price = $-219,912 + 26253 (BldgCategory1) – 6028.14 (BldgCategory2) + 3696.75 (BldgCategory3) -5911.87 (RoofStyleCategory1) + 4046.56 (RoofStyleCategory2) – 122.91 (LotFrontage) + .47268 (LotArea) + 445.50 (Year Built) + 447.66 (YearRemodel) + 34.39 (BsmtFinSF1) + 21.21 (BsmtFinSF2) + 21.72 (BsmtUnfSF) + 42.47 (FirstFlrSF) + 38.81 (SecondFlrSF) + 27.70 (GrLivArea) + 46.56 (MasVnrArea) – 11894 (BedroomAbvGr) + 2896.27 (TotRmsAbvGrd) + 27.68 (WoodDeckSF) + 43.69 (GarageArea) + 26.80 (TotalPorchSF) + 6282.23 (TotalFullBath) + 1715.20 (TotalBath) – 81.57 (PoolArea) – 16.54 (MiscVal) – 35.18 (MoSold) – 768.34 (YrSold)
The summary tables is as follows:
|
|
Forward
The results for forward:
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | -1782301 | 89312 | -19.96 | <.0001 | 0 |
BldgCategory1 | 1 | 26463 | 3400.23662 | 7.78 | <.0001 | 3.02124 |
BldgCategory3 | 1 | -3667.85178 | 1743.56514 | -2.10 | 0.0355 | 1.77022 |
BldgCategory4 | 1 | 2868.62553 | 1052.15634 | 2.73 | 0.0064 | 2.27804 |
RoofStyleCategory2 | 1 | 7013.65762 | 1060.00829 | 6.62 | <.0001 | 1.22414 |
LotFrontage | 1 | -118.22437 | 44.24125 | -2.67 | 0.0076 | 1.88594 |
LotArea | 1 | 0.49109 | 0.14045 | 3.50 | 0.0005 | 1.42891 |
YearBuilt | 1 | 454.58072 | 36.31198 | 12.52 | <.0001 | 2.31614 |
YearRemodel | 1 | 446.29592 | 47.95821 | 9.31 | <.0001 | 1.85731 |
BsmtFinSF1 | 1 | 13.14202 | 2.24037 | 5.87 | <.0001 | 1.90156 |
TotalBsmtSF | 1 | 22.82616 | 2.45576 | 9.29 | <.0001 | 2.15925 |
GrLivArea | 1 | 27.46345 | 15.98052 | 1.72 | 0.0858 | 115.20517 |
MasVnrArea | 1 | 46.95673 | 5.17037 | 9.08 | <.0001 | 1.53359 |
BedroomAbvGr | 1 | -12030 | 1331.04986 | -9.04 | <.0001 | 2.11592 |
TotRmsAbvGrd | 1 | 2875.92816 | 973.25129 | 2.95 | 0.0032 | 4.19229 |
WoodDeckSF | 1 | 28.07452 | 6.81001 | 4.12 | <.0001 | 1.19834 |
GarageArea | 1 | 43.41491 | 4.64724 | 9.34 | <.0001 | 1.88363 |
TotalPorchSF | 1 | 27.03065 | 7.57905 | 3.57 | 0.0004 | 1.20293 |
TotalFullBath | 1 | 7347.02074 | 1500.33565 | 4.90 | <.0001 | 2.23370 |
PoolArea | 1 | -80.00000 | 21.58215 | -3.71 | 0.0002 | 1.07861 |
MiscVal | 1 | -16.51693 | 1.53057 | -10.79 | <.0001 | 1.04974 |
TotalFlrSF | 1 | 41.17028 | 15.95513 | 2.58 | 0.0099 | 113.41659 |
SalePrice = $-1782301 + 26463 (BldgCategory1) – 3667.85 (BldgCategory3) + 2868 (BldgCategory4) + 7013.66 (RoofStyleCategory2) – 118.22 (LotFrontage) + .49 (LotArea) + 454.58 (YearBuilt) + 446.29 (YearRemodel) + 13.14 (BsmtFinSF1) + 22.83 (TotalBsmtSF) + 27.46 (GrLivArea) + 46.96 (MasVnrArea) – 12030 (BedroomAbvGr) + 2875.93 (TotRmsAbvGrd) + 28.07 (WoodDeckSF) + 43.41 (GarageArea) + 27.03 (TotalPorchSF) + 7347.02 (TotalFullBath) – 80 (PoolArea) – 16.52 (MiscVal) + 41.17 (TotalFlrSF)
The summary tables is as follows:
|
|
Backward
The results for backward:
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | -898361 | 94705 | -9.49 | <.0001 | 0 |
BldgCategory1 | 1 | 30720 | 2800.77084 | 10.97 | <.0001 | 2.04445 |
BldgCategory3 | 1 | 4968.19798 | 1312.78379 | 3.78 | 0.0002 | 1.98960 |
RoofStyleCategory2 | 1 | 7004.04276 | 1072.60999 | 6.53 | <.0001 | 1.25012 |
LotFrontage | 1 | -136.20619 | 44.75800 | -3.04 | 0.0024 | 1.92517 |
LotArea | 1 | 0.48205 | 0.14148 | 3.41 | 0.0007 | 1.44606 |
YearRemodel | 1 | 459.55343 | 47.79496 | 9.62 | <.0001 | 1.83983 |
BsmtFinSF1 | 1 | 35.81479 | 3.46619 | 10.33 | <.0001 | 4.53972 |
BsmtFinSF2 | 1 | 23.04218 | 5.59400 | 4.12 | <.0001 | 1.45543 |
BsmtUnfSF | 1 | 22.49539 | 3.25792 | 6.90 | <.0001 | 3.68985 |
FirstFlrSF | 1 | 69.32416 | 4.49442 | 15.42 | <.0001 | 5.53062 |
SecondFlrSF | 1 | 68.76492 | 3.26578 | 21.06 | <.0001 | 3.34208 |
MasVnrArea | 1 | 47.14530 | 5.17415 | 9.11 | <.0001 | 1.53178 |
BedroomAbvGr | 1 | -12245 | 1340.48877 | -9.14 | <.0001 | 2.14039 |
TotRmsAbvGrd | 1 | 2839.94970 | 956.51194 | 2.97 | 0.0030 | 4.03865 |
WoodDeckSF | 1 | 28.84852 | 6.85068 | 4.21 | <.0001 | 1.20950 |
GarageArea | 1 | 43.27370 | 4.65750 | 9.29 | <.0001 | 1.88697 |
TotalPorchSF | 1 | 28.71710 | 7.57459 | 3.79 | 0.0002 | 1.19835 |
TotalFullBath | 1 | 7117.04288 | 1512.48962 | 4.71 | <.0001 | 2.26406 |
PoolArea | 1 | -78.44319 | 21.62607 | -3.63 | 0.0003 | 1.08015 |
MiscVal | 1 | -16.57944 | 1.53399 | -10.81 | <.0001 | 1.05167 |
HouseAge | 1 | -446.97428 | 36.03808 | -12.40 | <.0001 | 2.28088 |
SalePrice = $-898,361 + 30720 BldgCategory1 + 4968.2 BldgCategory3 + 7004.04 RoofStyleCategory2 – 136.21 LotFrontage + .48205 LotArea + 459.55 YearRemodel + 35.81 BsmtFinSF1 + 23.04218 BsmtFinSF2 + 22.49 BsmtUnfSF + 69.32 FirstFlrSF + 68.76 SecondFlrsf + 47.14 MasVnrArea -12245 BedroomAbvGr + 2839.95 TotRmsAbvGrd + 28.85 WoodDeckSF + 43.273 GarageArea + 28.72 TotalPorchSF + 7117.04 TotalFullBath – 78.44 PoolArea -16.58 MiscVal – 446.97 HouseAge.
The summary tables is as follows:
|
|
Stepwise
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | -1791258 | 88579 | -20.22 | <.0001 | 0 |
BldgCategory1 | 1 | 30720 | 2788.97315 | 11.01 | <.0001 | 2.02766 |
BldgCategory3 | 1 | 4993.99705 | 1312.10462 | 3.81 | 0.0001 | 1.98793 |
RoofStyleCategory2 | 1 | 7025.30365 | 1060.92716 | 6.62 | <.0001 | 1.22327 |
LotFrontage | 1 | -135.37629 | 43.61445 | -3.10 | 0.0019 | 1.82841 |
LotArea | 1 | 0.48669 | 0.14061 | 3.46 | 0.0005 | 1.42864 |
YearBuilt | 1 | 446.51498 | 36.11812 | 12.36 | <.0001 | 2.28588 |
YearRemodel | 1 | 457.74856 | 47.74653 | 9.59 | <.0001 | 1.83646 |
BsmtFinSF1 | 1 | 13.24029 | 2.23817 | 5.92 | <.0001 | 1.89319 |
TotalBsmtSF | 1 | 22.84988 | 2.45267 | 9.32 | <.0001 | 2.14857 |
MasVnrArea | 1 | 47.24907 | 5.17090 | 9.14 | <.0001 | 1.53016 |
BedroomAbvGr | 1 | -12264 | 1328.27884 | -9.23 | <.0001 | 2.10198 |
TotRmsAbvGrd | 1 | 2841.91110 | 953.94485 | 2.98 | 0.0029 | 4.01779 |
WoodDeckSF | 1 | 28.94598 | 6.80815 | 4.25 | <.0001 | 1.19476 |
GarageArea | 1 | 43.34671 | 4.65270 | 9.32 | <.0001 | 1.88345 |
TotalPorchSF | 1 | 28.74800 | 7.56187 | 3.80 | 0.0001 | 1.19457 |
TotalFullBath | 1 | 7112.83010 | 1493.88128 | 4.76 | <.0001 | 2.20913 |
PoolArea | 1 | -77.40687 | 21.58338 | -3.59 | 0.0003 | 1.07610 |
MiscVal | 1 | -16.59398 | 1.53219 | -10.83 | <.0001 | 1.04940 |
TotalFlrSF | 1 | 68.87176 | 3.18078 | 21.65 | <.0001 | 4.49660 |
The results for Stepwise:
SalePrice = $-1,791,258 + 30,720 (BldgCategory1) + 4993.997 (BldgCategory3)+7,025.3 (RoofStyleCategory2) – 135.38 (Lot Frontage) + .48669 (LotArea) + 446.51498 (YearBuilt) + 457.74856 (YearRemodel) + 13.24029 (BsmtFinSF1) + 22.84988 (TotalBsmtSF) + 47.24907 (MasVnrArea) -12264 (BedroomAbvGr) + 2841.91110 (TotRmsAbvGrd) + 28.94598 (WoodDeckSF) + 43.34671 (GarageArea) + 28.748 (TotalPorchSF) + 7112.83 (TotalFullBath) -77.40687 (PoolArea) – 16.59398 (MiscVal) + 68.87176 (TotalFlrSF)
The summary tables is as follows:
|
|
An overall summary of each model using the automated variable selection is as follows:
Model | Root MSE | Cp | R-Square | Adj R-Square | F Value | AIC | BIC | Parameters |
Adjusted R-Square | 36,972 | 20.57885 | 0.8050 | 0.8032 | 449.71 | 50934.936 | 50937.528 | 23 |
Mallow’s Cp | 36,974 | 20.57946 | 0.8049 | 0.8032 | 471.04 | 50928.816 | 50931.282 | 22 |
R-Squared | 37,000 | 20.59402 | 0.8051 | 0.8029 | 365.95 | 50938.175 | 50940.830 | 28 |
Forward | 36,985 | 20.58572 | 0.8048 | 0.8031 | 470.69 | 50930.288 | 50932.727 | 22 |
Backward | 37,034 | 20.61288 | 0.8042 | 0.8025 | 469.15 | 50936.669 | 50938.992 | 22 |
Stepwise | 37,030 | 20.61087 | 0.8041 | 0.8026 | 518.55 | 50934.214 | 50936.482 | 20 |
Based on the above summary, each technique created a different model based on the variables selected. As reflected above, the Adjusted R-Square and Mallow’s Cp are the most similar with an identical Adjusted R-Square and very similar Root MSE and Cp. The Forward model is seemed to be in the middle of the Adjusted R-Square model and Mallow’s Cp with BIC AIC values within the range of the other two models.
The number of parameters used in each of the models also varied significantly with the minimum of 20 used in the Stepwise model and the maximum of 28 in the R-Squared model. Choosing the best regression model often considers the number of regressions to be as accurate as possible but also needs to consider the costs of attaining the data and maintaining the model thus, balancing it with fewer regressors as possible due to the variance of the prediction increases as regressors increase (p. 328). Additionally, we can see in the above summary table that the F Value of the R-Squared model is lower than the Adjusted R-Square and Mallow’s Cp model. The other consideration is that we know that R-Squared increases when additional predictors are added to a model thus, a model with more terms may appear to have a better fit because of the increased terms.
Based on the above while considering the number of parameters, I think the Adjusted R-Square is the best model.
Sale Price = $-861,952 + 26367 (BldgCategory1) – 5582.03 (BldgCategory2) + 3773.80 (BldgCategory3) -6051.65 (RoofStyleCategory1) + 4053.25 (RoofStyleCategory2) – 119.37 (LotFrontage) + .48446 (LotArea) + 445.58 (YearRemodel) + 13.18 (BsmtFinSF1) + 22.86 (BsmtFinSF1) + 68.42 (GrLivArea) + 46.83 (MasVnrArea) – 40.56 (LowQualFinSF) – 12010 (BedroomAbvGr) + 2881.62 (TotRmsAbvGrd) + 28.06 (WoodDeckSF) + 43.69 (GarageArea) + 27.03 (TotalPorchSF) + 7400.21 (TotalFullBath) – 80.30 (PoolArea) – 16.47 (MiscVal) – 458.08 (HouseAge)
Automated Variable Selection Conclusion – Full Data Set
Based on the above, each modelling technique created a different regression model all resulting in different resulting statistics. While some of the Goodness of Fit statistics had minimal differences between the model’s others were more significant. I believe the R-Square is the best model due to the highest R-Square adjusted value, R-Square, F Value and AIC despite having more variables over the other models. This is based on the logic that the variables and able to be acquired easily with no a very low cost and no obstacles exist in attaining the data.
Step 6: Multicollinarity on Automated Variable Selection
Dummy coded variables were both selected in the automated selection process – BldgCategory# and RoofStyleCategory#. While RoofStyleCategory only has two items and both have been included in the chosen model but in the BldgCategory only three of the variables have been automatically selected. However, the other two BldgCategory4 and BldgCategory5. This is due to perfect multicollinarity where the variance inflation factor is infinite and thus, cannot invert the matrix and solve for beta. Thus, SAS informs us that the model is not full rank and that the estimate is biased which is reflected with a 0 or B in the Degrees of Freedom column.
Thus, despite adding this category to the model, the additional variables (BldgCategory4 and BldgCategory5) do not appear.
Evaluating the continuous variables to determine if they are statistically significant, we have several variables where the p-value is larger than 0.0001 and thus, cannot conclude that a significant difference exists. These variables are as follows:
- LotFrontage: p=.007
- LowQualFinSF: p = .0111
Below 0.0001, significant. Over 0.0001, not significant.
Analysis of Variance | |||||
Source | DF | Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 24 | 1.468993E13 | 6.120806E11 | 464.09 | <.0001 |
Error | 2879 | 3.797091E12 | 1318892283 | ||
Corrected Total | 2903 | 1.848703E13 |
Root MSE | 36317 | R-Square | 0.7946 |
Dependent Mean | 180389 | Adj R-Sq | 0.7929 |
Coeff Var | 20.13231 |
Parameter Estimates | |||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | B | 283643 | 1048796 | 0.27 | 0.7868 |
BldgCategory1 | 1 | 23476 | 3055.76203 | 7.68 | <.0001 |
BldgCategory2 | 1 | -6663.69163 | 2446.63474 | -2.72 | 0.0065 |
BldgCategory3 | 1 | 4172.51115 | 1303.12090 | 3.20 | 0.0014 |
BldgCategory4 | 0 | 0 | . | . | . |
BldgCategory5 | 0 | 0 | . | . | . |
LotArea | 1 | 0.26521 | 0.09511 | 2.79 | 0.0053 |
YearBuilt | B | 438.12761 | 35.59630 | 12.31 | <.0001 |
YearRemodel | 1 | 477.79154 | 44.05346 | 10.85 | <.0001 |
BsmtFinSF1 | B | 38.52489 | 3.13014 | 12.31 | <.0001 |
BsmtFinSF2 | B | 26.20953 | 4.89464 | 5.35 | <.0001 |
BsmtUnfSF | B | 24.25396 | 2.91452 | 8.32 | <.0001 |
FirstFlrSF | B | 45.16961 | 15.16688 | 2.98 | 0.0029 |
SecondFlrSF | B | 42.68354 | 14.94773 | 2.86 | 0.0043 |
GrLivArea | 1 | 22.52528 | 14.92543 | 1.51 | 0.1314 |
MasVnrArea | 1 | 47.32331 | 4.45495 | 10.62 | <.0001 |
BedroomAbvGr | 1 | -12206 | 1211.08737 | -10.08 | <.0001 |
TotRmsAbvGrd | 1 | 3591.76057 | 869.75024 | 4.13 | <.0001 |
WoodDeckSF | 1 | 22.29958 | 5.85747 | 3.81 | 0.0001 |
GarageArea | 1 | 40.60470 | 4.21044 | 9.64 | <.0001 |
TotalPorchSF | 1 | 23.28952 | 6.88993 | 3.38 | 0.0007 |
TotalFullBath | B | 6528.68535 | 1769.20532 | 3.69 | 0.0002 |
TotalBath | B | -560.77294 | 1664.61745 | -0.34 | 0.7362 |
PoolArea | 1 | -56.20449 | 19.41995 | -2.89 | 0.0038 |
MiscVal | 1 | -10.60894 | 1.20582 | -8.80 | <.0001 |
MoSold | 1 | 95.39370 | 253.30278 | 0.38 | 0.7065 |
YrSold | B | -1044.79891 | 521.10521 | -2.00 | 0.0451 |
TotalFlrSF | 0 | 0 | . | . | . |
Sale Price = $283,643 + 26467 (BldgCategory1) – 6663.69 (BldgCategory2) + 4172.51 (BldgCategory3) + .26521 (LotArea) + 438.12 (YearBuilt) + 477.79 (YearRemodel) + 38.52 (BsmtFinSF1) + 26.21 (BsmtFinSF2) + 24.25 (BsmtUnfSF) + 45.68 (SecondFlrSF) + 22.52 (GrLivArea) + 47.32 (MasVnrArea) – 12,206 (BedrromAbvGrd) + 3591.76 (TotRmsAbvGrd) + 22.29 (WoodDeckSF) + 40.60 (GarageArea) + 23.29 (TotalPorchSF) + 6528.68 (TotalFullBath) – 560.77 (TotalBath) – 56.20 (PoolArea) – 10.61 (MiscVal) + 95.39 (MoSold) – 1044.80 (YrSold)
The above is the final adjusted model with the three non-statistically significant variables removed but including all BldgCategories. As shown, two of the BldgCategories are not reflected in the equation as they are perfectly multicollinarity.
Part C: Validation Framework
Cross validation of data is accomplished using a training/test split of the data. Taking a step back, we will split up the data as follows:
- Training/In-Sample: 70% of the data is used to train the model which you will see below at just over 2,000 observations
- Test/Out-of-Sample: 30% of the data will be held back with 100% of the data being used for testing which you will see below at around 900 observations
We will then assess the differences in the goodness-of-fit statistics to see how they differ from prior models using 100% of the data and evaluate the differences.
Step 8: Training Data
Using the automated variable selection, a summary of each of the techniques are provided below. In each of the models, only the training data is used which consists of 1,672 observations instead of the full dataset which contained 2,420 observations but includes missing values.
Adjusted R-Square
Using the training observations on the data, the Adjusted R-Square model summary is as follows:
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | -843974 | 120147 | -7.02 | <.0001 | 0 |
BldgCategory1 | 1 | 24100 | 4398.68558 | 5.48 | <.0001 | 3.22678 |
BldgCategory2 | 1 | -8479.20085 | 3341.40474 | -2.54 | 0.0113 | 2.05397 |
BldgCategory3 | 1 | 2896.38863 | 1795.83228 | 1.61 | 0.1070 | 2.42093 |
RoofStyleCategory2 | 1 | 7129.59882 | 1371.01316 | 5.20 | <.0001 | 1.27554 |
LotFrontage | 1 | -123.84500 | 56.34235 | -2.20 | 0.0281 | 1.97656 |
LotArea | 1 | 0.40106 | 0.15809 | 2.54 | 0.0113 | 1.40113 |
YearRemodel | 1 | 430.77533 | 60.66339 | 7.10 | <.0001 | 1.82482 |
BsmtFinSF1 | 1 | 8.54006 | 2.93508 | 2.91 | 0.0037 | 2.09251 |
TotalBsmtSF | 1 | 20.89315 | 4.00546 | 5.22 | <.0001 | 3.86713 |
FirstFlrSF | 1 | 70.36493 | 5.55393 | 12.67 | <.0001 | 5.61303 |
SecondFlrSF | 1 | 64.55788 | 4.51295 | 14.31 | <.0001 | 4.03614 |
MasVnrArea | 1 | 47.48084 | 6.52991 | 7.27 | <.0001 | 1.60962 |
LowQualFinSF | 1 | 37.21592 | 20.54811 | 1.81 | 0.0703 | 1.06074 |
BedroomAbvGr | 1 | -12253 | 1711.38313 | -7.16 | <.0001 | 2.16414 |
TotRmsAbvGrd | 1 | 3437.03017 | 1215.36878 | 2.83 | 0.0047 | 4.09604 |
WoodDeckSF | 1 | 27.52888 | 8.75038 | 3.15 | 0.0017 | 1.21352 |
GarageArea | 1 | 47.20537 | 5.94685 | 7.94 | <.0001 | 1.93937 |
TotalPorchSF | 1 | 27.81892 | 9.46298 | 2.94 | 0.0033 | 1.22996 |
TotalFullBath | 1 | 10092 | 2034.89844 | 4.96 | <.0001 | 2.65568 |
TotalHalfBath | 1 | 3401.02263 | 2318.13769 | 1.47 | 0.1425 | 1.75272 |
PoolArea | 1 | -83.69329 | 27.47086 | -3.05 | 0.0024 | 1.12358 |
MiscVal | 1 | -23.66412 | 1.96647 | -12.03 | <.0001 | 1.08594 |
HouseAge | 1 | -424.70078 | 48.81669 | -8.70 | <.0001 | 2.54090 |
Mallow’s Cp
Using the training observations data, the Mallow’s Cp model summary is as follows:
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | -840908 | 120154 | -7.00 | <.0001 | 0 |
BldgCategory1 | 1 | 24294 | 4388.81865 | 5.54 | <.0001 | 3.21114 |
BldgCategory2 | 1 | -7687.49655 | 3245.97774 | -2.37 | 0.0180 | 1.93762 |
BldgCategory3 | 1 | 2984.57200 | 1785.88782 | 1.67 | 0.0949 | 2.39332 |
RoofStyleCategory2 | 1 | 7211.79054 | 1352.13044 | 5.33 | <.0001 | 1.24019 |
LotFrontage | 1 | -119.52279 | 55.36294 | -2.16 | 0.0310 | 1.90774 |
LotArea | 1 | 0.42285 | 0.15738 | 2.69 | 0.0073 | 1.38814 |
YearRemodel | 1 | 431.13897 | 60.67388 | 7.11 | <.0001 | 1.82478 |
BsmtFinSF1 | 1 | 9.55869 | 2.85710 | 3.35 | 0.0008 | 1.98208 |
TotalBsmtSF | 1 | 22.38886 | 3.04707 | 7.35 | <.0001 | 2.23712 |
GrLivArea | 1 | 68.00277 | 3.96977 | 17.13 | <.0001 | 4.65655 |
MasVnrArea | 1 | 48.06803 | 6.51989 | 7.37 | <.0001 | 1.60410 |
LowQualFinSF | 1 | -30.07548 | 20.49185 | -1.47 | 0.1424 | 1.05455 |
BedroomAbvGr | 1 | -12345 | 1691.18699 | -7.30 | <.0001 | 2.11259 |
TotRmsAbvGrd | 1 | 3366.84955 | 1214.59310 | 2.77 | 0.0056 | 4.08933 |
WoodDeckSF | 1 | 28.40475 | 8.73491 | 3.25 | 0.0012 | 1.20879 |
GarageArea | 1 | 47.13780 | 5.94162 | 7.93 | <.0001 | 1.93526 |
TotalPorchSF | 1 | 28.21510 | 9.44812 | 2.99 | 0.0029 | 1.22565 |
TotalFullBath | 1 | 8971.49719 | 1892.49088 | 4.74 | <.0001 | 2.29615 |
PoolArea | 1 | -84.39866 | 27.44621 | -3.08 | 0.0021 | 1.12115 |
MiscVal | 1 | -23.49573 | 1.96161 | -11.98 | <.0001 | 1.08019 |
HouseAge | 1 | -445.57359 | 46.41973 | -9.60 | <.0001 | 2.29666 |
R-Square
Using the training observations data, the R-Square model summary is as follows:
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | B | 271537 | 1484146 | 0.18 | 0.8549 | 0 |
BldgCategory1 | 1 | 23989 | 4406.21865 | 5.44 | <.0001 | 3.23276 |
BldgCategory2 | 1 | -8478.80283 | 3351.17430 | -2.53 | 0.0115 | 2.06276 |
BldgCategory3 | 1 | 2859.88123 | 1798.57360 | 1.59 | 0.1120 | 2.42453 |
BldgCategory4 | 0 | 0 | . | . | . | . |
BldgCategory5 | 0 | 0 | . | . | . | . |
RoofStyleCategory1 | 1 | -4257.98288 | 6825.03906 | -0.62 | 0.5328 | 8.53205 |
RoofStyleCategory2 | 1 | 5053.32160 | 3602.34296 | 1.40 | 0.1609 | 8.79224 |
LotFrontage | 1 | -124.71032 | 56.41024 | -2.21 | 0.0272 | 1.97822 |
LotArea | 1 | 0.39253 | 0.15905 | 2.47 | 0.0137 | 1.41602 |
YearBuilt | 1 | 425.87291 | 49.72948 | 8.56 | <.0001 | 2.63086 |
YearRemodel | 1 | 433.10734 | 61.39363 | 7.05 | <.0001 | 1.86609 |
BsmtFinSF1 | B | 29.44503 | 4.34360 | 6.78 | <.0001 | 4.57558 |
BsmtFinSF2 | B | 22.30864 | 7.29786 | 3.06 | 0.0023 | 1.42093 |
BsmtUnfSF | B | 20.74245 | 4.03077 | 5.15 | <.0001 | 3.58348 |
TotalBsmtSF | 0 | 0 | . | . | . | . |
FirstFlrSF | B | 32.93950 | 20.96756 | 1.57 | 0.1164 | 79.87514 |
SecondFlrSF | B | 27.01162 | 20.57882 | 1.31 | 0.1895 | 83.79231 |
GrLivArea | B | 37.51693 | 20.58794 | 1.82 | 0.0686 | 125.09373 |
MasVnrArea | 1 | 47.34455 | 6.53894 | 7.24 | <.0001 | 1.61155 |
LowQualFinSF | 0 | 0 | . | . | . | . |
BedroomAbvGr | 1 | -12232 | 1713.67326 | -7.14 | <.0001 | 2.16654 |
TotRmsAbvGrd | 1 | 3446.03263 | 1217.54125 | 2.83 | 0.0047 | 4.10426 |
WoodDeckSF | 1 | 27.28586 | 8.80642 | 3.10 | 0.0020 | 1.22719 |
GarageArea | 1 | 47.71718 | 5.97396 | 7.99 | <.0001 | 1.95403 |
TotalPorchSF | 1 | 27.57210 | 9.49428 | 2.90 | 0.0037 | 1.23617 |
TotalFullBath | B | 6748.88770 | 2463.70706 | 2.74 | 0.0062 | 3.88676 |
TotalBath | B | 3291.27701 | 2356.11755 | 1.40 | 0.1626 | 5.26565 |
TotalHalfBath | 0 | 0 | . | . | . | . |
PoolArea | 1 | -85.63831 | 27.60710 | -3.10 | 0.0020 | 1.13297 |
MiscVal | 1 | -23.62538 | 1.96873 | -12.00 | <.0001 | 1.08673 |
MoSold | 1 | -275.81048 | 359.89128 | -0.77 | 0.4436 | 1.04079 |
YrSold | 1 | -980.82714 | 739.07943 | -1.33 | 0.1847 | 1.04896 |
Forward
Using the training observations data, the Forward model summary is as follows:
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | -1731524 | 114861 | -15.07 | <.0001 | 0 |
BldgCategory1 | 1 | 24334 | 4390.34749 | 5.54 | <.0001 | 3.21131 |
BldgCategory2 | 1 | -7690.63757 | 3247.02389 | -2.37 | 0.0180 | 1.93762 |
BldgCategory3 | 1 | 3002.36345 | 1786.38089 | 1.68 | 0.0930 | 2.39310 |
RoofStyleCategory2 | 1 | 7195.33895 | 1352.51100 | 5.32 | <.0001 | 1.24009 |
LotFrontage | 1 | -120.09066 | 55.38180 | -2.17 | 0.0303 | 1.90781 |
LotArea | 1 | 0.42441 | 0.15744 | 2.70 | 0.0071 | 1.38827 |
YearBuilt | 1 | 445.28878 | 46.67546 | 9.54 | <.0001 | 2.31894 |
YearRemodel | 1 | 429.39054 | 60.83216 | 7.06 | <.0001 | 1.83314 |
BsmtFinSF1 | 1 | 9.50117 | 2.85775 | 3.32 | 0.0009 | 1.98170 |
TotalBsmtSF | 1 | 22.37203 | 3.04882 | 7.34 | <.0001 | 2.23825 |
GrLivArea | 1 | 37.78666 | 20.55161 | 1.84 | 0.0662 | 124.72222 |
MasVnrArea | 1 | 48.23994 | 6.52019 | 7.40 | <.0001 | 1.60321 |
BedroomAbvGr | 1 | -12354 | 1691.80579 | -7.30 | <.0001 | 2.11277 |
TotRmsAbvGrd | 1 | 3374.48124 | 1215.02275 | 2.78 | 0.0055 | 4.08958 |
WoodDeckSF | 1 | 28.54544 | 8.73726 | 3.27 | 0.0011 | 1.20866 |
GarageArea | 1 | 47.16336 | 5.94525 | 7.93 | <.0001 | 1.93637 |
TotalPorchSF | 1 | 28.42878 | 9.45734 | 3.01 | 0.0027 | 1.22725 |
TotalFullBath | 1 | 8947.56166 | 1894.29789 | 4.72 | <.0001 | 2.29905 |
PoolArea | 1 | -83.72517 | 27.45360 | -3.05 | 0.0023 | 1.12104 |
MiscVal | 1 | -23.48982 | 1.96224 | -11.97 | <.0001 | 1.08019 |
TotalFlrSF | 1 | 30.24018 | 20.49786 | 1.48 | 0.1403 | 123.30754 |
Backward
Using the training observations data, the Backward model summary is as follows:
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | -869800 | 120710 | -7.21 | <.0001 | 0 |
BldgCategory1 | 1 | 18744 | 2991.54371 | 6.27 | <.0001 | 1.47787 |
BldgCategory2 | 1 | -8897.32941 | 2889.24938 | -3.08 | 0.0021 | 1.52065 |
RoofStyleCategory2 | 1 | 7524.63928 | 1350.08448 | 5.57 | <.0001 | 1.22477 |
YearRemodel | 1 | 450.24030 | 60.83855 | 7.40 | <.0001 | 1.81739 |
BsmtFinSF1 | 1 | 31.78586 | 3.34469 | 9.50 | <.0001 | 2.69069 |
BsmtFinSF2 | 1 | 26.84265 | 6.69302 | 4.01 | <.0001 | 1.18531 |
BsmtUnfSF | 1 | 22.60594 | 3.04779 | 7.42 | <.0001 | 2.03191 |
GrLivArea | 1 | 74.71645 | 3.15531 | 23.68 | <.0001 | 2.91406 |
MasVnrArea | 1 | 47.22611 | 6.52973 | 7.23 | <.0001 | 1.59376 |
BedroomAbvGr | 1 | -10932 | 1541.43217 | -7.09 | <.0001 | 1.73845 |
WoodDeckSF | 1 | 29.99457 | 8.81104 | 3.40 | 0.0007 | 1.21835 |
GarageArea | 1 | 46.28525 | 5.90316 | 7.84 | <.0001 | 1.89226 |
TotalPorchSF | 1 | 26.82969 | 9.46670 | 2.83 | 0.0047 | 1.21886 |
TotalFullBath | 1 | 8968.61730 | 1926.90743 | 4.65 | <.0001 | 2.35796 |
PoolArea | 1 | -98.53460 | 26.97263 | -3.65 | 0.0003 | 1.07258 |
MiscVal | 1 | -23.02329 | 1.96539 | -11.71 | <.0001 | 1.07412 |
HouseAge | 1 | -450.02291 | 46.25954 | -9.73 | <.0001 | 2.25931 |
Stepwise
Using the training observations data, the Stepward model summary is as follows:
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | -1716883 | 113840 | -15.08 | <.0001 | 0 |
BldgCategory1 | 1 | 17936 | 2997.75814 | 5.98 | <.0001 | 1.48649 |
BldgCategory2 | 1 | -10877 | 2942.69344 | -3.70 | 0.0002 | 1.58005 |
RoofStyleCategory2 | 1 | 7150.53026 | 1352.59512 | 5.29 | <.0001 | 1.23138 |
YearBuilt | 1 | 438.07997 | 46.34024 | 9.45 | <.0001 | 2.26941 |
YearRemodel | 1 | 430.51754 | 60.87042 | 7.07 | <.0001 | 1.82232 |
BsmtFinSF1 | 1 | 8.89032 | 2.85083 | 3.12 | 0.0018 | 1.95801 |
TotalBsmtSF | 1 | 23.22825 | 3.01450 | 7.71 | <.0001 | 2.17251 |
MasVnrArea | 1 | 46.77977 | 6.52778 | 7.17 | <.0001 | 1.59546 |
BedroomAbvGr | 1 | -12773 | 1690.26994 | -7.56 | <.0001 | 2.09387 |
TotRmsAbvGrd | 1 | 3298.05155 | 1203.11458 | 2.74 | 0.0062 | 3.98115 |
WoodDeckSF | 1 | 30.70852 | 8.74739 | 3.51 | 0.0005 | 1.20281 |
GarageArea | 1 | 46.23613 | 5.90182 | 7.83 | <.0001 | 1.89455 |
TotalPorchSF | 1 | 30.09617 | 9.44231 | 3.19 | 0.0015 | 1.21461 |
TotalFullBath | 1 | 9240.18250 | 1897.63053 | 4.87 | <.0001 | 2.29066 |
PoolArea | 1 | -86.41275 | 26.97373 | -3.20 | 0.0014 | 1.07445 |
MiscVal | 1 | -23.36203 | 1.96651 | -11.88 | <.0001 | 1.07713 |
TotalFlrSF | 1 | 68.21494 | 3.92390 | 17.38 | <.0001 | 4.48636 |
When comparing the training data set to the full data set, a number of Goodness-of-Fit statistics change including:
- Root MSE training data increases from an average of 37,000 to 39,053 over the full data which is a 5.26% increase. While this is relatively small increase, we prefer a smaller Root MSE, using the training data decreases the fit of our model.
- Cp also increased from an average value of 20.598 to 21.61 which is also a relatively small increase of 4.7% over the full data. Again, we prefer to have a smaller Cp value.
- Small difference in the Adjusted R-Square moving from .802 to .79 in the training vs. full data set respectively which is a decrease of 1.07%. Unlike the Root MSE and Cp, we prefer a higher Adjusted R-Square value.
- The F Value moved from an average of 458 in the full data to an average of 327 in the training set which is a 39.9% change. A larger F value is preferred and the change in the number of observations significantly changes the indications in the fit of our model.
- AIC range in the full model had an average of 50,936 but declined to an average of 35,377 in the training sample which is a decline of almost 44%. This is expected due to the number of observations (n) being included in the formula and thus, as the number declines from the full data set to the training set, it is natural for the AIC volume to decline.
- AIC equation
- BIC range in the full model had an average of 50,936.31 which also declined to an average of 35,380 in the training sample which is also a decline of 44% this is expected as the BIC formula is based on the number of observations (n) thus, as the number of observations decline from the training data from the full data, the BIC statistic would also decline.
- BIC equation:
- There was also a decrease in the number of parameters used with the average in the full model of 23.67 and only 21.33 in the training sample which is approximately a 10.9% decrease.
Overall, the training data versus the full dataset have some minor differences in the majority of the goodness-of-fit statistics. This is expected as the training set data is not going to be as accurate as utilizing all of the all the data but as long as the jumps are not significantly different then we know the model is performing well. As described above, the only significant jumps were in the AIC and BIC statistics which is a natural reflection of the number of observations. Thus, our model is performing well.
Step 9: Summary of Results and Comparisons
Model | Adj R-Square | F Value | AIC | BIC | Parameters | MSE | MAE |
Adjusted R-Square | 0.7946 | 282.1 | 35,374.521 | 35,377.375 | 24 | 1,095,324,801.4 | 22,839.01 |
Mallow’s Cp | 0.7946 | 308.73 | 35,372.950 | 35,375.580 | 22 | 1,090,304,849.9 | 22,757.98 |
R-Squared | 0.7943 | 239.98 | 35,377.519 | 35,380.472 | 28 | 1,097,933,922.6 | 22,829.66 |
Forward | 0.7929 | 377.41 | 35,378.720 | 35,380.870 | 18 | 1,136,636,015.7 | 23,222.32 |
Backward | 0.7926 | 376.62 | 35,381.500 | 35,383.590 | 18 | 1,149,916,522.5 | 23,222.32 |
Stepwise | 0.7929 | 377.41 | 35,378.720 | 35,380.870 | 18 | 1,136,636,015.7 | 23,079.73 |
Assigning The average squared error (MSE) and the average absolute error (MAE) for the test sample and the validation sample are as follows:
Model | Obs | train | _TYPE_ | _FREQ_ | MSE_1 | MAE_1 |
Adjusted R Square: In-Sample | 1 | 0 | 1 | 891 | 1,095,324,801.4 | 22,839.01 |
Adjusted R Square: Out-Sample | 2 | 1 | 1 | 2039 | 1,496,167,717.1 | 24,052.89 |
Cp: In-Sample | 1 | 0 | 1 | 891 | 1,090,304,849.9 | 22,757.98 |
Cp: Out-Sample | 2 | 1 | 1 | 2039 | 1,498,529,842.4 | 24,026.27 |
R-Square: In-Sample | 1 | 0 | 1 | 891 | 1,097,933,922.6 | 22,829.66 |
R-Square: Out-Sample | 2 | 1 | 1 | 2039 | 1,494,877,983.8 | 24,076.99 |
Forward: In-Sample | 1 | 0 | 1 | 891 | 1,136,636,015.7 | 23,079.73 |
Forward: Out-Sample | 2 | 1 | 1 | 2039 | 1,458,140,826.3 | 23,755.93 |
Backward: In-Sample | 1 | 0 | 1 | 891 | 1,149,916,522.5 | 23,222.32 |
Backward: Out-Sample | 2 | 1 | 1 | 2039 | 1,460,556,185.2 | 23,869.79 |
Stepwise: In-Sample | 1 | 0 | 1 | 891 | 1,136,636,015.7 | 23,079.73 |
Stepwise: Out-Sample | 2 | 1 | 1 | 2039 | 1,458,140,826.3 | 23,755.93 |
In all of the cases above, the in-sample data performs worse than the out-of-sample data which is no surprise as we would expect the training set to be a bit worse than the test data. For the average squared error (MSE) the in-sample data was on average approx. 75.7% of the full data MSE. Additionally, the average absolute error (MAE) on the in-sample data was on average 96% of the out-of-sample data.
Additionally, evaluating the Average Mean Square error the lowest in-sample model was the Cp model whereas the best MSE out of sample was the Forward model. With the Average Absolute Error, the best in-model was again the Cp model but the best out-of-sample was again the Forward model.
Step 10: Operational Validation
To change the format of “Prediction_Grade” in SAS we add in a format command as follows:
if yhat >= (SalePrice *0.9) and yhat <= (SalePrice *1.1) then
Prediction_Grade=”Grade 1″;
FORMAT Prediction_Grade $7.;
We can verify that it works in the output data variable overview:
From a business perspective, the models from both the training and test set predict house sale price within various levels of frequency and error as follows:
Train = 0 (training sample – 70%) training set | Train = 1 (All data – 100%) test set | ||||||||||||||||||||||||||||||||||||||||
|
|
Hence, based on the training set, 430 or 48.26% house sale price we are able to predict within plus/minus 10% error, 162 or 18.18% of the houses sale price we are able to predict between, but not equal to, 10% – 15% error and 299 houses or 33.56% of housing sale prices we are able to predict with more than 15% error.
Similarly, based on the test set, 1,000 or 49.04% house sale price we are able to predict within plus/minus 10% error, 335 or 16.43% of the houses sale price we are able to predict between 10%-15% error and 704 houses or 34.53% of housing sale prices we are able to predict with more than 15% error.
Thus, depending on the expected deliverables of the model, we will be able to determine if the model is complete or whether further reiterations are required.
Step 11: Final Model
Based on step 9, the Forward model is the best model. However, we did incorporate categorical variables and thus, need to ensure all categorical variables including the RoofStyleCategory# are added back into the model and then applied to the test data. The results are as follows:
|
|
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | B | 341417 | 1039835 | 0.33 | 0.7427 | 0 |
BldgCategory1 | 1 | 22972 | 3032.18797 | 7.58 | <.0001 | 2.94959 |
BldgCategory2 | 0 | 0 | . | . | . | . |
BldgCategory3 | 1 | -3971.02347 | 1621.08445 | -2.45 | 0.0144 | 1.91404 |
BldgCategory4 | 1 | 2936.69439 | 969.46734 | 3.03 | 0.0025 | 2.46653 |
BldgCategory5 | 0 | 0 | . | . | . | . |
RoofStyleCategory1 | 1 | -1623.80456 | 4884.52714 | -0.33 | 0.7396 | 8.81513 |
RoofStyleCategory2 | 1 | 5975.39267 | 2560.67281 | 2.33 | 0.0197 | 8.98265 |
LotArea | 1 | 0.27286 | 0.09438 | 2.89 | 0.0039 | 1.23545 |
YearBuilt | 1 | 445.67105 | 35.41682 | 12.58 | <.0001 | 2.56547 |
YearRemodel | 1 | 486.79230 | 43.70459 | 11.14 | <.0001 | 1.86135 |
BsmtFinSF1 | 1 | 37.29072 | 3.10915 | 11.99 | <.0001 | 4.48675 |
BsmtFinSF2 | 1 | 24.83558 | 4.86118 | 5.11 | <.0001 | 1.52529 |
BsmtUnfSF | 1 | 24.03698 | 2.89111 | 8.31 | <.0001 | 3.60834 |
FirstFlrSF | B | 40.82363 | 15.05372 | 2.71 | 0.0067 | 76.55732 |
SecondFlrSF | B | 42.24975 | 14.82388 | 2.85 | 0.0044 | 90.19210 |
GrLivArea | 1 | 25.11865 | 14.80217 | 1.70 | 0.0898 | 124.91251 |
MasVnrArea | 1 | 41.28416 | 4.49551 | 9.18 | <.0001 | 1.45294 |
BedroomAbvGr | 1 | -12235 | 1201.70744 | -10.18 | <.0001 | 2.21617 |
TotRmsAbvGrd | 1 | 3105.88521 | 865.17930 | 3.59 | 0.0003 | 4.14841 |
WoodDeckSF | 1 | 22.67392 | 5.80752 | 3.90 | <.0001 | 1.21089 |
GarageArea | 1 | 40.36999 | 4.17657 | 9.67 | <.0001 | 1.80983 |
TotalPorchSF | 1 | 23.56863 | 6.83097 | 3.45 | 0.0006 | 1.21094 |
TotalFullBath | 1 | 6582.18214 | 1754.82039 | 3.75 | 0.0002 | 3.91453 |
TotalBath | 1 | -567.87782 | 1650.52211 | -0.34 | 0.7308 | 5.38145 |
PoolArea | 1 | -54.70574 | 19.26009 | -2.84 | 0.0045 | 1.06203 |
MiscVal | 1 | -10.80135 | 1.19592 | -9.03 | <.0001 | 1.03643 |
MoSold | 1 | 36.08454 | 251.33838 | 0.14 | 0.8859 | 1.04020 |
YrSold | 1 | -1087.09519 | 516.66708 | -2.10 | 0.0355 | 1.03946 |
TotalFlrSF | 0 | 0 | . | . | . | . |
Sale Price = $341,417 + 22,972 (BldgCategory1) – 3,971.02 (BldgCategory3) + 2,936.69 (BldgCategory4) – 1,623.80 (RoofStyleCategory1) + 5,975.39 (RoofStyleCategory2) +.27286 (LotArea) + 445.67 (YearBuilt) + 486.79 (YearRemodel) + 37.29 (BsmtFinSF1) + 24.84 (BsmtFinSF2) + 24.04 (BsmtUnfSF) + 40.82 (FirstFlrSF) + 42.25 (SecondFlrSF) + 25.12 (GrLivArea) + 41.28 (MasVnrArea) – 12,235 (BedroomAbvGr) + 3,105.88 (TotRmsAbvGrd) + 22.67 (WoodDeckSF) + 40.37 (GarageArea) + 23.57 (TotalPorchSF) + 6,582.18 (TotalFullBath) – 567.88 (TotalBath) – 54.71 (PoolArea) – 10.80 (MiscVal) + 36.08 (MoSold) – 1,087.09 (YrSold)
Step 12: Reflection/Conclusion
In conclusion, the housing data from Ames, has been utilized to develop numerous regression models to determine the mean sales price of a house based on numerous variables. The variables range from highly correlated, continuous variables to categorical variables with low correlations. In this installment, we continued building the model by assigning dummy codes to categorical variables and the use of automated variable selection procedures. A validation framework was conducted and the automated variable selected variables were compared using the training sample versus the test data. Operational validation based on the values were compared and analyzed. The final and best model was also provided along with the results.
Understanding the goals of the model are key in determining if they should be employed and to what degree. Additionally, the use of these techniques is more of an art than a science. If the model already meets the needs of its purpose than perhaps these methods are not required. However, if the model is not appropriate than utilizing these techniques to transform the model to become more linear is appropriate.
Employing a process to help determine which observations are outliers and requiring an evaluation can to see if it is valid or if it is an error. If an observation is an error, then it should be removed. However, removing legitimate observations will distort the model and should not be removed. If done correctly, the transformation and deletion of outliers will benefit the model by creating a better, more appropriate model that is more linear. However, it does require additional processes and a certain finesse to determine the appropriate thresholds values. As easily observed from the above discussion, the choice of thresholds can drastically impact the number of observations it impacts and needs to be carefully selected and employed.
As reflected in this iteration in the development of the regression model based on the Ames housing data, we came up with the best model however, additional steps can still be taken. Outliers still exist and need to be further investigated to determine whether the observations are legitimate or not and should be discarded. This process while potentially time consuming, could provide a lot of insight that may or may not be very valuable. This additional insight may assist in uncovering other variables that may reduce the 20% of variability that is still not accounted for. Violation of the assumptions of constant variance and normality as seen in the plots still exists. Additionally, transformation on the data can be utilized but for simplicity and ease of understanding of potential home buyers has been left in its original dollar values. However, transforming variables will assist in an increase in the goodness of fit and potentially creating a better fitting model. Thus, if the intended audience is more sophisticated and well versed in interpreting log values, than I would recommend transforming the variables. If not, and the current model meets the desired audience and requirements than an outlier investigation as detailed above should be undertaken.