# Regression Models Using Numerous Variables

## Assessing Regression Models Using Numerous Variables

Regression model on the Amex, Iowa housing data set builds regression models for the house sale price with numerous variables.  Some of which are highly correlated, continuous variables along to the other side of the continuum by evaluating categorical, low correlated variables.  An assessment of each model will be conducted along with a review on the statistical and ODS output will be conducted and interpreted.  Additionally, linear regression models will be conducted based on two and three variables along with an evaluation of the impact of these new variables and how they add value to predicting the sales price of a house.  Finally, an assessment on the model being specified and the next steps will be provided.

## Part A – Step 1

The variable I choose is was one that was created in first installment – Total Floor Square Footage (TotalFlrSF).

 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Intercept 1 11406 3242.59761 3.52 0.0004 TotalFlrSF 1 113.30303 2.05569 55.12 <.0001

### Equation

The normal equation is y = β0 + β1x + ε

Based on the above parameter estimates, the equation is:

• Sale Price = \$11,406 + \$113.30303 x TotalFlrSF

Thus, for each unit increase in the TotalFlrSF, an increase in sales price of \$113.30 occurs.  This assumes all values are greater than zero but even at zero, a SalePrice of\$11,406 would result.  As we are evaluating houses, this would be logical but could be if perhaps a house was not livable and is considered a ‘tear down’ where someone would spend the time and money required to build a new house.  However, this would be outside the norm and would require a different equation to determine these types of sale prices.

The automatic generated ODS output:

The various produced ODS output from SAS shows a cluster of points in the residual versus predicted value indicating there is an issue as it should be completely random without any type of pattern.  The Q-Q Plot titled “Quantile” is abnormal as it deviates from the line and has heavy tails which means there is a larger probability of getting very large values.  The Predicted Value plot again shows that there are issues as the points are not on the line.

### Assess the Goodness-of-Fit of this model

 Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 9.518383E12 9.518383E12 3037.86 <.0001 Error 2928 9.174155E12 3133249517 Corrected Total 2929 1.869254E13

 Root MSE 55975 R-Square 0.5092 Dependent Mean 180796 Adj R-Sq 0.5090 Coeff Var 30.9605

The P value is small but not equal to zero which means we can reject the null hypothesis that it is equal to zero.  Thus, each variable is significant. Additionally, the statistical summaries show an R-Square of .5092 and an adjusted R-Square of .509 which is also very good as it represents the proportion of variability in the dependent variable that can be explained by the regression model.

• Fitted regression model over the scatter-plot: The fit Plot shows a plot of all of the data along with the 95% confidence and prediction limits.  We can clearly see the amount of data that exceeds the boundaries confidence limits.
• Assessment of the normality of the residuals using a Q-Q plot and/or histogram of the standardized residuals:  The normal quantile plot of the residuals and the residual histogram are consistent with the assumption of Gaussian errors and shows the narrow spread in the residuals but with a high peak. The Q-Q plot helps detect violations from normality if they are normal, the points will cluster tightly around a reference line.  As we can see, we have a deviation from the line and has heavy tails which means there is a larger probability of getting very large values.  Thus, normality is not true in this model.
• Assessment of homoscedasticity by plotting the predictor variable against the standardized residuals:  The quantile histogram helps diagnose violation of the normality and homoscedasticity assumptions.  The points on the Predicted Value versus Sales Price plot of the dependent variables versus the predicted values along the 45-degree line show a definite pattern which indicates that the model is not appropriate.  Homoscedasticity also referred to as the assumption of constant error variance looks for errors that aren’t independent causing the plot to be more linear looking or that flares out as x increases.  Thus, based on the above, this model does not appear to be homoscedastic.
• Check for potential outliers using Cook’s Distance:  Cook’s Distance measures the effect of each data point on the predicted value with the lowest value being zero and the highest point showing the influence of the point.  As we can see, there are only two points that exceed .2 but are less than one, the acceptable limit, an investigation into these two outliers should be conducted.  Thus, these two points could be outliers but having an influence on the results.  While the above does not reflect the actual Cook’s D score, the values show how the vector of fitted values move when the observation is deleted.

The variable TotalSqFt has normality issues based on the above analysis.  The ODS plots reflect patterns of clusters, heavy tailed Q-Q plot, and the Predicted versus Sale Price are not on the line.  Additionally, the Residual plot does not appear to be complete randomness and the Fit Plot which we would expect to be random, has more of a sideways funnel pattern.  However, the P value is < .0001 which means each variable is significant and that we can reject the null hypotheses.  Additionally, the Adjusted R-Square is .509 which is significant.  The matrix of graphs and plots provides a high-level overview of the relationship between total square footage in a house and sale price.

## Part A – Step 2

### The Best X to Predict Y in the Regression Model

The ‘best’ simple linear regression model to predict Sales Price using the R-square option reflects the following output:

 Variables in Model R-Square Adjusted C(p) R-Square TotalFlrSF 0.5118 0.5116 3161.986 GrLivArea 0.5006 0.5004 3289.886 GarageArea 0.4235 0.4233 4169.764 TotalBsmtSF 0.421 0.4207 4199.299 FirstFlrSF 0.4115 0.4112 4307.718

Ironically, Total Floor Square Footage has the highest R-square value and Adjusted R-square which was discussed in Part A Step 1.  So not to repeat the same outputs, I’ll use GrLivArea for the rest of this task.

 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Intercept 1 13290 3269.70277 4.06 <.0001 GrLivArea 1 111.69400 2.06607 54.06 <.0001

### Equation & Interpret each coefficient:

The equation for GrLivArea is SalePrice = \$13,290 x 111.694 x GrLivArea.   Which reflects that for each unit increase in GrLivArea the Sale Price increases by \$111.69.  This seems reasonable and logical.

R2 measures the variability in y remaining after x has been considered and is often called the proportion of variance explained by the regressor x (Montgomery, D.  Introduction to Linear Regression Analysis. p. 36).  So, the variables with the closest R2 equaling to one implies that they explain most of the variability in y.

 Variables in Model R-Square Adjusted R-Square C(p) Interpretation Overlap? TotalFlrSF 0.5118 0.5116 3161.986 Makes logical sense in terms in explaining the variability in sales price Overlap GrLivArea 0.5006 0.5004 3289.886 Logical but overlaps with TotalFlrSF Overlap GarageArea 0.4235 0.4233 4169.764 Not logical on its own as a house needs to go with a garage TotalBsmtSF 0.421 0.4207 4199.299 Logical but overlaps with TotalFlrSF and GrLivArea Overlap FirstFlrSF 0.4115 0.4112 4307.718 Would have thought this to be stronger than TotalBsmtSF; Overlaps with TotalFlrSF, GrLivArea, TotalBsmtSF Overlap TotalBath 0.3868 0.3866 4589.011 Logical as it plays an important part in the sales price of a house Overlap TotalFullBath 0.359 0.3588 4906.859 Logical and expected that it would have a lower R than total number of bathrooms Overlap HouseAge 0.3213 0.321 5338.312 Logical in that newer housses would likely have a higher sales price Overlap YearBuilt 0.3209 0.3206 5342.262 Expected to be very similar to HouseAge which it is so no surprise Overlap YearRemodel 0.2894 0.2891 5702.33 Logical in terms of explaining the variability in sales price; somewhat associated with houseage and yrbuilt MasVnrArea 0.2843 0.284 5760.722 Not logical as I would have thought it to have lesser impact on the sales price than some of the variables with a lesser R TotRmsAbvGrd 0.2523 0.252 6126.552 Logical in terms of explaining the variability in sales price Overlap BsmtFinSF1 0.1966 0.1963 6762.719 Logical as I think people are more concerned about the bsmt sq ft over being a finished area or not Overlap LotFrontage 0.127 0.1266 7557.59 Logical as the sales price and lotfrontage would be less likely to explain the variability WoodDeckSF 0.117 0.1166 7671.945 Logical as it ranks higher than the amount of porch space but surprised that it is ranked higher than LotArea LotArea 0.1027 0.1023 7835.853 Not logical as I would have thought the LotArea would be more directly related to sales price SecondFlrSF 0.0636 0.0633 8281.516 Logical as likely to be similar to the first floor and is related to TotalFlrSF Overlap TotalHalfBath 0.0544 0.054 8387.015 Logical as sales price is likely to be more effected by totalbath and fullbaths Overlap BsmtUnfSF 0.0386 0.0382 8567.744 Logical as I think people are more concerned about the bsmt sq ft over being a finished area or not Overlap TotalPorchSF 0.0293 0.0289 8673.387 Logical as it is of little importance to the sales price BedroomAbvGr 0.0188 0.0184 8793.586 Logical and has less impact than TotalRmsAbvGrd which makes sense Overlap PoolArea 0.0048 0.0044 8953.456 Logical as would expect it to have little impact on sales value LowQualFinSF 0.0013 0.0009 8994.168 Logical based on what I would expect people to report on this variable and other variable that may be related i.e. sale condition MoSold 0.0007 0.0003 9000.444 Logical – when a house is sold should have minimal impact on the sales price Overlap YrSold 0.0006 0.0001 9002.185 Logical – when a house is sold should have minimal impact on the sales price Overlap MiscVal 0.0003 -0.0001 9005.503 Logical given the few items and their value to the overall sales price BsmtFinSF2 0 -0.0004 9008.506 Logical given the dataset and the potential duplication and confusion on this variable Overlap

### In what sense is the model the ‘best’ model:

GrLivArea is the best model as it logical to the sales price of a house which people obviously plan to live in so they are more likely to spend additional money for a larger space.  It has an Adjusted R-Square of .5004.  The p value is small so we can reject the null hypothesis.

### Anything funny about it from an interpretation standpoint?

The equation is ‘funny’ if you consider a house with zero GrLivArea which would result in a sale price of \$13,290.  While one could perhaps justify, it based on the price of land only, it is illogical.  Additionally, the increase in value based on a one unit change of ~\$111.70 is logical.  Overall, we would expect the model to be better with the increased Adjusted R-Square yet we still have the same issues and concerns based on the plots.

### Goodness-of-Fit

Goodness-of-fit statistics on the GrLivArea are as follows:

 Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 9.33763E12 9.33763E12 2922.59 <.0001 Error 2928 9.354907E12 3194981962 Corrected Total 2929 1.869254E13

 Root MSE 56524 R-Square 0.4995 Dependent Mean 180796 Adj R-Sq 0.4994 Coeff Var 31.2641

The F Value is 2,922.59 is significantly greater than one.  The p-value is small reflecting that each variable is significant. Based on the above, there is a linear relationship between SalePrice and GrLivArea. The Adjusted R-Square of .4994 reflects the variability that is accounted for using SalesPrice and GrLivArea.  While there is a small difference between R-Square and Adjusted R-Square, Adjusted R-Square will not penalize us for adding variables and will only change if the variable reduces the residual mean square.  The P value is small so we can reject the null hypothesis.

The ODS output is as follows:

The Predicted Value to the Residual has a definite pattern and seems to fan out showing that the variable is increasing.  The ODS Quantile or Q-Q plot indicates that normality is not true as the points deviate from the line (see red area) and is heavy tailed. Again, similar to TotalSqFt Cook’s D we have three influential points with two points that exceed 0.2.  It would only be prudent to look into these three points to ensure they are valid.  Additionally, the Residuals for Sale Price are not random as we would expect and again has a bit of a pattern to it.  The Fit Plot also has a distinct pattern to it.  Thus, we have normality issues.

The Residuals between SalePrice and GrLivArea are again, similar to TotalSqFt, where homoscedasticity does not appear to be linear within the plot but does flare out a bit as GrLivArea increases.  Thus, this model does not appear to be homoscedastic.

While GrLivArea is the highest Adjusted R-Square value out of all of the continuous variables accounting for .4994 of the variability leaving another .5006 unexplained.  Perhaps, one of the categorical variables will be assist in improving the unexplained portion.  Again, we have conflicting information between the adjusted R-Square the ODS plots.  Additionally, GrLivArea has a small p-value but we have normality issues based on the ODS plots.

### The Best X to Predict Y Conclusions

The GrLivArea variable is a variable with a small p-value which means that the variable is significant.  Additionally, it has a good Adjusted R-Square value however, the ODS outputs reflect that we have normality issues.  We have patterns in the Predicted Value versus Residual value where there shouldn’t be any, the Q-Q plot deviates from the line, the Predicted Value versus Sales Price is also not on the line which reflect that is it not ideal.  The residual plot also has a pattern and the fit plot also has a fanning out pattern.  Thus, the GrLivArea is not normal.

### Categorical value in a Regression Model:

Based on week one Exterior1 is a category variable with the following histogram

### Equation

Evaluating a few of the variables we get the following result:

 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Variance Inflation Intercept 1 150475 4457.27517 33.76 <.0001 0 Exterior1Category 1 2575.12360 357.88376 7.20 <.0001 1.00000

 Root MSE 79203 R-Square 0.0174 Dependent Mean 180796 Adj R-Sq 0.0170 Coeff Var 43.8146
• SalePrice = \$150,475 x \$2,575.12 x Exterior1

### Interpret Coefficient

While Siding is only one of sixteen variables within the Exterior1 variable, a one unit change will result in an increase in value by \$2,575 not that we can have a house with more than one siding, this is illogical but it simply means that a house will siding on average has an increased sales price of \$2,575.  The p-values is also greater than one which means we can reject the null hypothesis and conclude that there is a linear relationship between Exterior1 Siding and SalePrice.

Looking at the Adjusted R-Square, we see that SalePrice using Exterior1 only explains 1.74% of the variability which is low.

### Anything funny about the coefficient interpretation

I can’t see there is anything funny about the coefficient other than its low value in explaining variability.   Variance inflation is also one.

### Generate ODS output and assesses the Goodness-of-fit

Reviewing the ODS plots we don’t seem to have a pattern in the Predicted Valued and Residual plot unlike in the GrLivArea ODS plot but we do have another heavy-tailed Q-Q plot that deviates from the line, a cluster in the predicted value with a lot more spikes in Cook’s D reflecting that there are more observations that are having an influence on the results but is still less than one on the axis scale.  Additionally, we see skewed residual.

Reviewing the plots and various statistics, the residuals plot can’t fan out due to the type of variable but there does not appear to be any linear relationship between the different exterior1 categories.  The fit plot resembles shows a slight positive correlation with a bit of a ‘u’ shape between the data.  It appears to be bimodal which is reflective of categorical model.  Overall, Exterior1 is not normally distributed or a good predictor in determining SalePrice which is reflected in the various plot, statistics and reasons provided.

### Does the predicted model go through the mean value of Y for each category group?

The above Fit Plot line does appear to go through the mean value of each category but shows the distribution of values for each category based on the variability.  The confidence limits but as mentioned above, reflects a slight correlation which is reflected in the slight slope of the line.  The plot shows the distribution of values within each category with the largest amount of variability in the SalePrice of a house using the Exterior1 Category of category16 which is wood siding.  Having the line go through the mean doesn’t really provide any value as the categories are listed in order of their assigned value.

### Is this good or bad / why or why not?

I don’t think having the line adds any value and in fact, causes confusion and possible incorrect conclusions that there is some type of linear relationship when in fact there isn’t one.  Additionally, it doesn’t provide any insight on which categories perform better than others or the volume of data in each.

### Categorical Value Conclusion

While the Exterior1 category analysis is interesting to conduct, the results in being able to predict the SalePrice is low with the Adjusted R-Square of 0.0170.  Additionally, the plots and charts provided additional insight into Cook’s D observations with few influential peaks, a heavy tailed Q-Q plot.  No surprise, the does not appear to be a linear relationship between Exterior1 and SalePrice.  Finally, the Fit Plot would not be my preferred method of making categorical assessment of data as they add very little value and can be misleading.

## Part A – Task 4 Analysis

Of the three models described above, the best model is TotalSqFt as it has the highest Adjusted R-Square at .509, the errors across the values appear to be independent of the variables and thus considered to be homoscedastic.   Cook’s D has only a few data points that are having an influential influence.  These points along with the outliers need some further research but do not expect them to change the outcomes of the above analysis.  Additionally, the F value is significantly greater than one.  Again, we have conflicting information based on the plots and the R-Square values which needs to be further investigated.

## Part B – Task 5 Regression Model Using Two Variables

Variables:

The two variables are TotalFlrSF and GrLivArea.

### Equation:

 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Variance Inflation Intercept 1 11688 3238.62266 3.61 0.0003 0 TotalFlrSF 1 185.03078 22.40381 8.26 <.0001 119.15477 GrLivArea 1 -71.69170 22.29838 -3.22 0.0013 119.15477

 Root MSE 55886 R-Square 0.5109 Dependent Mean 180796 Adj R-Sq 0.5106 Coeff Var 30.9113
• Sale Price = \$11,688 + \$185.03078 x TotalFlrSF – \$71.69170 x GrLivArea

For each unit increase in TotalFlrSF the SalePrice increases by \$185 yet a decrease of \$71.69 impacts SalePrice for each unit increase in GrLivArea.  The above equation is different from the simple linear regression models above with the negative value each unit of GrLivArea has on SalePrice.  This is not intuitive but because of the unusual choice in variables and the obvious overlap, it makes logical sense.  Additionally, the Adjusted R-Square is .5106 again reflected the amount of variability that is accounted for in using these two variables.  However, this Adjusted R-Square has only improved slighted from the simple linear regression model using TotalFlrSF which had an Adjusted R-Square of .5090.  Again, this is not unexpected due to the overlap in the variables.

By adding one unit of TotalFlrSF with all other factors remaining constant, we can see how the SalePrice changes.  However, one unit of TotalFlrSF and GrLivArea are not the same and need to ensure that this is not confused.

### Goodness-of-Fit Regression Models – ODS Multiple Variables – TotalFlrArea and GrLivArea

Based on the above, we have a small p-value reflecting that each variable is significant and thus, we cannot reject the null hypothesis.  However, unlike the prior variables, we have a Variance inflation number of 119.15477 which is significantly outside of the acceptable range of 1 to 5.  Thus, we have multi-linearity issues.  We can see a lot of similarities with TotalFlSF with the fanning out pattern in the Predicted Value and Residual plot, the Q-Q plot again, deviates from the line and Cook’s D has only two influential points that exceeding 0.2 which is less than one.  Again, the Adjusted R-Square is the largest out of all of the models analyzed with a value of 0.5106.  Homoscedasticity looks for errors that aren’t independent causing the plot to be more linear looking or that flares out as x increases.  Again, there is a clear clustered pattern here.  Above, we can see compare TotalFlrSF and GrLivArea which, for the most part, appear to be identical.

### Better Fit

As mentioned earlier, we have only accounted for .5106 of the variability leaving another .4894 unexplained.  While this has increased from the prior models it is only a small increase.  Thus, a significant amount of additional work is required to reduce this value.

### Regression Model Using Two Variables Conclusion

Combining two continuous explanatory variables TotalFlSF and GrLivArea provided similar results to the simple linear regression model using TotalFlSF.  This is not surprising, as GrLivArea is encapsulated in TotalFlSF as such the equation has a negative component to it with a small increase to the Adjusted R-Square value.  The p-value reflects that each variable is significant and thus, we can reject the null hypothesis.  However, the variance influence exceeds our normal range reflecting that we have multi-linearity issues which isn’t a surprise.  Evaluating the plots based on the multiple variables is again, very similar to the analysis on TotalFlSF with the a fanning out in the Predicted Value and Residual plot. Thus, this we again have normality issues.

## Part B – Step 6 Using Three Variables

### Variables:

The three variables are TotalFlrSF, GrLivArea and MiscVal.

### Equation:

 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Variance Inflation Intercept 1 11105 3227.36083 3.44 0.0006 0 TotalFlrSF 1 186.43890 22.31325 8.36 <.0001 119.17355 GrLivArea 1 -72.39791 22.20695 -3.26 0.0011 119.15954 MiscVal 1 -9.14957 1.82008 -5.03 <.0001 1.00470

 Root MSE 55886 R-Square 0.5151 Dependent Mean 180796 Adj R-Sq 0.5146 Coeff Var 30.7839
• Sales Price = \$11,105 + \$186.4389 x TotalFlrSF – \$72.39791 x GrLivArea – \$9.14957 x MiscVal

For each unit increase in TotalFlrSF the SalePrice increases by \$186.44, a small increase of \$1.40 over the equation in part 5, and similarly a decrease of \$72.40 (a further decline from \$71.69 in part 5) but now there is another decrease which is \$9.15 per unit of MiscVal.  The decrease in value in MiscVal on SalePrice is not as intuitive as TotalFlrSF and GrLivArea however due to the low volume and value of the MiscVal data points it is understandable there the variable would have little impact on SalePrice. Regression Models – ODS Multiple Variables TotalFlrSF, GrLivArea and MiscVal.

### Goodness-of-Fit

We can see a lot of similarities with part 5, the p-values are small reflecting that each variable is significant and thus, we reject the null hypothesis.  The Variance Inflation for TotalFlSF is 119.17355, GrLivArea is 119.159 which exceed the upper boundaries of our 1-5 range and the MiscValue is close to the bottom of our acceptable range with a value of 1.00470.  We again see the Predicted Value to the Residual having a fanning out pattern which reflects that the variance is increasing.  Cook’s D plot has changed in our axis measures but still has only a few influential points.  The Q-Q plot is again, deviating from the line and is heavy tailed.  The Predicted Value to Sales Price shows a cluster of values.  Thus, we still have the same normality issues.

Adjusted R-Square increased from .5106 to .5146 again reflected the amount of variability that is accounted for in using these two variables.  However, this is a very small increase – 0.0040 over the Adjusted R-Value in question 5 and the value of adding the MiscVal variable.  The Adjusted R-Square has only improved slighted from the simple linear regression model using TotalFlrSF and the multiple variables between TotalFlSF and GrLivArea. Regression Models – Residuals Multiple Variables TotalFlrSF, GrLivArea and MiscVal.

The Residual plots for each variable provides more of a view on how they alter depending on the variables over the results in part 5 but still do not appear to be random with distinct patterns to each variable.  However, we can see how MiscVal errors do not provide any easy to see, linear relationship and has only a few values.  Thus, we still have normality issues.

Changes & Better Fit

Overall, the additional variable – MiscVal has not made any real impact in predicting SalePrice.  While this is no surprise, based on its low correlation value.   While our Adjusted R-Value did increase, it was not significant with a small .0040 increase thus showing us that adding additional values does not necessarily improve the results or assist us in creating a better fitting equation and explaining the variability.  While Adjusted R-Square plays an increased importance on the criteria in comparing the models primarily due to not wanting to make incorrect decisions by simply adding variables, the plots and charts are helpful in understand the relationships and how they impact the model.

### Using Three Variables Conclusion

Adding additional variables do not necessarily improve results.  By adding a low correlated variable – MiscVal the other variables within the equation received minor adjustments.  While each of the variables has a small p-value we can reject the null hypothesis.  Similar to the prior analysis, there are not a lot of changes with the exception being the variance influence.  The review of the plots reflects similar patterns with the exception of the residual plot where the MiscValue has a different pattern but still not a pattern of randomness which we hope for.  However, the analysis remained relatively the same as to having only two variables as in part 5.  Additionally, adding the MiscVal variable had a minor effect in explaining the variability and did not have any significant impact on SalePrice.

## Conclusion

By observing the various models and the effects of adding and reviewing various variables we can easily grasp the changes through the various plots, equations and summary statistics.  Continuous variables have a much different look and resulting plots over continuous variables.  Additionally, evaluating the Adjusted R-Square and F-Value provides insight into how the model is performing even when the impact is difficult to ascertain through the plots.

The model may not be appropriately specified as all variables seem to be equal which may not be the case.  Additionally, we could be adding new variables without considering their association with other variables as shown between TotalFlSF and GrLivArea – removing GrLivArea would have significant effects on TotalFlSF.

Next steps in the modeling process would be to investigate a few of the influential points that have been mentioned in several parts such as Cook’s D and focus on determining whether the Adjusted R-Score can be enhanced through the evaluation of different variables and try to account for more of the variability as we all of the models have not exceeded 51.46%.  Finally, transformation of the data so the conflicts between the R-squares and ODS outputs are decreased would be really helpful in creating a useful model we are confident in.