Regression Models Using Numerous Variables
Assessing Regression Models Using Numerous Variables
Regression model on the Amex, Iowa housing data set builds regression models for the house sale price with numerous variables. Some of which are highly correlated, continuous variables along to the other side of the continuum by evaluating categorical, low correlated variables. An assessment of each model will be conducted along with a review on the statistical and ODS output will be conducted and interpreted. Additionally, linear regression models will be conducted based on two and three variables along with an evaluation of the impact of these new variables and how they add value to predicting the sales price of a house. Finally, an assessment on the model being specified and the next steps will be provided.
Part A – Step 1
The variable I choose is was one that was created in first installment – Total Floor Square Footage (TotalFlrSF).
Parameter Estimates | |||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 11406 | 3242.59761 | 3.52 | 0.0004 |
TotalFlrSF | 1 | 113.30303 | 2.05569 | 55.12 | <.0001 |
Equation
The normal equation is y = β0 + β1x + ε
Based on the above parameter estimates, the equation is:
- Sale Price = $11,406 + $113.30303 x TotalFlrSF
Thus, for each unit increase in the TotalFlrSF, an increase in sales price of $113.30 occurs. This assumes all values are greater than zero but even at zero, a SalePrice of$11,406 would result. As we are evaluating houses, this would be logical but could be if perhaps a house was not livable and is considered a ‘tear down’ where someone would spend the time and money required to build a new house. However, this would be outside the norm and would require a different equation to determine these types of sale prices.
Model Adequacy
The automatic generated ODS output:
The various produced ODS output from SAS shows a cluster of points in the residual versus predicted value indicating there is an issue as it should be completely random without any type of pattern. The Q-Q Plot titled “Quantile” is abnormal as it deviates from the line and has heavy tails which means there is a larger probability of getting very large values. The Predicted Value plot again shows that there are issues as the points are not on the line.
Assess the Goodness-of-Fit of this model
Analysis of Variance | |||||
Source | DF | Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 1 | 9.518383E12 | 9.518383E12 | 3037.86 | <.0001 |
Error | 2928 | 9.174155E12 | 3133249517 | ||
Corrected Total | 2929 | 1.869254E13 |
Root MSE | 55975 | R-Square | 0.5092 |
Dependent Mean | 180796 | Adj R-Sq | 0.5090 |
Coeff Var | 30.96054 |
The P value is small but not equal to zero which means we can reject the null hypothesis that it is equal to zero. Thus, each variable is significant. Additionally, the statistical summaries show an R-Square of .5092 and an adjusted R-Square of .509 which is also very good as it represents the proportion of variability in the dependent variable that can be explained by the regression model.
- Fitted regression model over the scatter-plot: The fit Plot shows a plot of all of the data along with the 95% confidence and prediction limits. We can clearly see the amount of data that exceeds the boundaries confidence limits.
- Assessment of the normality of the residuals using a Q-Q plot and/or histogram of the standardized residuals: The normal quantile plot of the residuals and the residual histogram are consistent with the assumption of Gaussian errors and shows the narrow spread in the residuals but with a high peak. The Q-Q plot helps detect violations from normality if they are normal, the points will cluster tightly around a reference line. As we can see, we have a deviation from the line and has heavy tails which means there is a larger probability of getting very large values. Thus, normality is not true in this model.
- Assessment of homoscedasticity by plotting the predictor variable against the standardized residuals: The quantile histogram helps diagnose violation of the normality and homoscedasticity assumptions. The points on the Predicted Value versus Sales Price plot of the dependent variables versus the predicted values along the 45-degree line show a definite pattern which indicates that the model is not appropriate. Homoscedasticity also referred to as the assumption of constant error variance looks for errors that aren’t independent causing the plot to be more linear looking or that flares out as x increases. Thus, based on the above, this model does not appear to be homoscedastic.
- Check for potential outliers using Cook’s Distance: Cook’s Distance measures the effect of each data point on the predicted value with the lowest value being zero and the highest point showing the influence of the point. As we can see, there are only two points that exceed .2 but are less than one, the acceptable limit, an investigation into these two outliers should be conducted. Thus, these two points could be outliers but having an influence on the results. While the above does not reflect the actual Cook’s D score, the values show how the vector of fitted values move when the observation is deleted.
Model Adequacy Conclusions
The variable TotalSqFt has normality issues based on the above analysis. The ODS plots reflect patterns of clusters, heavy tailed Q-Q plot, and the Predicted versus Sale Price are not on the line. Additionally, the Residual plot does not appear to be complete randomness and the Fit Plot which we would expect to be random, has more of a sideways funnel pattern. However, the P value is < .0001 which means each variable is significant and that we can reject the null hypotheses. Additionally, the Adjusted R-Square is .509 which is significant. The matrix of graphs and plots provides a high-level overview of the relationship between total square footage in a house and sale price.
Part A – Step 2
The Best X to Predict Y in the Regression Model
The ‘best’ simple linear regression model to predict Sales Price using the R-square option reflects the following output:
Variables in Model | R-Square | Adjusted | C(p) |
R-Square | |||
TotalFlrSF | 0.5118 | 0.5116 | 3161.986 |
GrLivArea | 0.5006 | 0.5004 | 3289.886 |
GarageArea | 0.4235 | 0.4233 | 4169.764 |
TotalBsmtSF | 0.421 | 0.4207 | 4199.299 |
FirstFlrSF | 0.4115 | 0.4112 | 4307.718 |
Ironically, Total Floor Square Footage has the highest R-square value and Adjusted R-square which was discussed in Part A Step 1. So not to repeat the same outputs, I’ll use GrLivArea for the rest of this task.
Parameter Estimates | |||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 13290 | 3269.70277 | 4.06 | <.0001 |
GrLivArea | 1 | 111.69400 | 2.06607 | 54.06 | <.0001 |
Equation & Interpret each coefficient:
The equation for GrLivArea is SalePrice = $13,290 x 111.694 x GrLivArea. Which reflects that for each unit increase in GrLivArea the Sale Price increases by $111.69. This seems reasonable and logical.
R2 measures the variability in y remaining after x has been considered and is often called the proportion of variance explained by the regressor x (Montgomery, D. Introduction to Linear Regression Analysis. p. 36). So, the variables with the closest R2 equaling to one implies that they explain most of the variability in y.
Variables in Model | R-Square | Adjusted
R-Square |
C(p) | Interpretation | Overlap? |
TotalFlrSF | 0.5118 | 0.5116 | 3161.986 | Makes logical sense in terms in explaining the variability in sales price | Overlap |
GrLivArea | 0.5006 | 0.5004 | 3289.886 | Logical but overlaps with TotalFlrSF | Overlap |
GarageArea | 0.4235 | 0.4233 | 4169.764 | Not logical on its own as a house needs to go with a garage | |
TotalBsmtSF | 0.421 | 0.4207 | 4199.299 | Logical but overlaps with TotalFlrSF and GrLivArea | Overlap |
FirstFlrSF | 0.4115 | 0.4112 | 4307.718 | Would have thought this to be stronger than TotalBsmtSF; Overlaps with TotalFlrSF, GrLivArea, TotalBsmtSF | Overlap |
TotalBath | 0.3868 | 0.3866 | 4589.011 | Logical as it plays an important part in the sales price of a house | Overlap |
TotalFullBath | 0.359 | 0.3588 | 4906.859 | Logical and expected that it would have a lower R than total number of bathrooms | Overlap |
HouseAge | 0.3213 | 0.321 | 5338.312 | Logical in that newer housses would likely have a higher sales price | Overlap |
YearBuilt | 0.3209 | 0.3206 | 5342.262 | Expected to be very similar to HouseAge which it is so no surprise | Overlap |
YearRemodel | 0.2894 | 0.2891 | 5702.33 | Logical in terms of explaining the variability in sales price; somewhat associated with houseage and yrbuilt | |
MasVnrArea | 0.2843 | 0.284 | 5760.722 | Not logical as I would have thought it to have lesser impact on the sales price than some of the variables with a lesser R | |
TotRmsAbvGrd | 0.2523 | 0.252 | 6126.552 | Logical in terms of explaining the variability in sales price | Overlap |
BsmtFinSF1 | 0.1966 | 0.1963 | 6762.719 | Logical as I think people are more concerned about the bsmt sq ft over being a finished area or not | Overlap |
LotFrontage | 0.127 | 0.1266 | 7557.59 | Logical as the sales price and lotfrontage would be less likely to explain the variability | |
WoodDeckSF | 0.117 | 0.1166 | 7671.945 | Logical as it ranks higher than the amount of porch space but surprised that it is ranked higher than LotArea | |
LotArea | 0.1027 | 0.1023 | 7835.853 | Not logical as I would have thought the LotArea would be more directly related to sales price | |
SecondFlrSF | 0.0636 | 0.0633 | 8281.516 | Logical as likely to be similar to the first floor and is related to TotalFlrSF | Overlap |
TotalHalfBath | 0.0544 | 0.054 | 8387.015 | Logical as sales price is likely to be more effected by totalbath and fullbaths | Overlap |
BsmtUnfSF | 0.0386 | 0.0382 | 8567.744 | Logical as I think people are more concerned about the bsmt sq ft over being a finished area or not | Overlap |
TotalPorchSF | 0.0293 | 0.0289 | 8673.387 | Logical as it is of little importance to the sales price | |
BedroomAbvGr | 0.0188 | 0.0184 | 8793.586 | Logical and has less impact than TotalRmsAbvGrd which makes sense | Overlap |
PoolArea | 0.0048 | 0.0044 | 8953.456 | Logical as would expect it to have little impact on sales value | |
LowQualFinSF | 0.0013 | 0.0009 | 8994.168 | Logical based on what I would expect people to report on this variable and other variable that may be related i.e. sale condition | |
MoSold | 0.0007 | 0.0003 | 9000.444 | Logical – when a house is sold should have minimal impact on the sales price | Overlap |
YrSold | 0.0006 | 0.0001 | 9002.185 | Logical – when a house is sold should have minimal impact on the sales price | Overlap |
MiscVal | 0.0003 | -0.0001 | 9005.503 | Logical given the few items and their value to the overall sales price | |
BsmtFinSF2 | 0 | -0.0004 | 9008.506 | Logical given the dataset and the potential duplication and confusion on this variable | Overlap |
In what sense is the model the ‘best’ model:
GrLivArea is the best model as it logical to the sales price of a house which people obviously plan to live in so they are more likely to spend additional money for a larger space. It has an Adjusted R-Square of .5004. The p value is small so we can reject the null hypothesis.
Anything funny about it from an interpretation standpoint?
The equation is ‘funny’ if you consider a house with zero GrLivArea which would result in a sale price of $13,290. While one could perhaps justify, it based on the price of land only, it is illogical. Additionally, the increase in value based on a one unit change of ~$111.70 is logical. Overall, we would expect the model to be better with the increased Adjusted R-Square yet we still have the same issues and concerns based on the plots.
Goodness-of-Fit
Goodness-of-fit statistics on the GrLivArea are as follows:
Analysis of Variance | |||||
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
Model | 1 | 9.33763E12 | 9.33763E12 | 2922.59 | <.0001 |
Error | 2928 | 9.354907E12 | 3194981962 | ||
Corrected Total | 2929 | 1.869254E13 |
Root MSE | 56524 | R-Square | 0.4995 |
Dependent Mean | 180796 | Adj R-Sq | 0.4994 |
Coeff Var | 31.26405 |
The F Value is 2,922.59 is significantly greater than one. The p-value is small reflecting that each variable is significant. Based on the above, there is a linear relationship between SalePrice and GrLivArea. The Adjusted R-Square of .4994 reflects the variability that is accounted for using SalesPrice and GrLivArea. While there is a small difference between R-Square and Adjusted R-Square, Adjusted R-Square will not penalize us for adding variables and will only change if the variable reduces the residual mean square. The P value is small so we can reject the null hypothesis.
The ODS output is as follows:
The Predicted Value to the Residual has a definite pattern and seems to fan out showing that the variable is increasing. The ODS Quantile or Q-Q plot indicates that normality is not true as the points deviate from the line (see red area) and is heavy tailed. Again, similar to TotalSqFt Cook’s D we have three influential points with two points that exceed 0.2. It would only be prudent to look into these three points to ensure they are valid. Additionally, the Residuals for Sale Price are not random as we would expect and again has a bit of a pattern to it. The Fit Plot also has a distinct pattern to it. Thus, we have normality issues.
The Residuals between SalePrice and GrLivArea are again, similar to TotalSqFt, where homoscedasticity does not appear to be linear within the plot but does flare out a bit as GrLivArea increases. Thus, this model does not appear to be homoscedastic.
Comments on Adequacy
While GrLivArea is the highest Adjusted R-Square value out of all of the continuous variables accounting for .4994 of the variability leaving another .5006 unexplained. Perhaps, one of the categorical variables will be assist in improving the unexplained portion. Again, we have conflicting information between the adjusted R-Square the ODS plots. Additionally, GrLivArea has a small p-value but we have normality issues based on the ODS plots.
The Best X to Predict Y Conclusions
The GrLivArea variable is a variable with a small p-value which means that the variable is significant. Additionally, it has a good Adjusted R-Square value however, the ODS outputs reflect that we have normality issues. We have patterns in the Predicted Value versus Residual value where there shouldn’t be any, the Q-Q plot deviates from the line, the Predicted Value versus Sales Price is also not on the line which reflect that is it not ideal. The residual plot also has a pattern and the fit plot also has a fanning out pattern. Thus, the GrLivArea is not normal.
Part A – Step 3
Categorical value in a Regression Model:
Based on week one Exterior1 is a category variable with the following histogram
Equation
Evaluating a few of the variables we get the following result:
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | 150475 | 4457.27517 | 33.76 | <.0001 | 0 |
Exterior1Category | 1 | 2575.12360 | 357.88376 | 7.20 | <.0001 | 1.00000 |
Root MSE | 79203 | R-Square | 0.0174 |
Dependent Mean | 180796 | Adj R-Sq | 0.0170 |
Coeff Var | 43.81456 |
- SalePrice = $150,475 x $2,575.12 x Exterior1
Interpret Coefficient
While Siding is only one of sixteen variables within the Exterior1 variable, a one unit change will result in an increase in value by $2,575 not that we can have a house with more than one siding, this is illogical but it simply means that a house will siding on average has an increased sales price of $2,575. The p-values is also greater than one which means we can reject the null hypothesis and conclude that there is a linear relationship between Exterior1 Siding and SalePrice.
Looking at the Adjusted R-Square, we see that SalePrice using Exterior1 only explains 1.74% of the variability which is low.
Anything funny about the coefficient interpretation
I can’t see there is anything funny about the coefficient other than its low value in explaining variability. Variance inflation is also one.
Generate ODS output and assesses the Goodness-of-fit
Reviewing the ODS plots we don’t seem to have a pattern in the Predicted Valued and Residual plot unlike in the GrLivArea ODS plot but we do have another heavy-tailed Q-Q plot that deviates from the line, a cluster in the predicted value with a lot more spikes in Cook’s D reflecting that there are more observations that are having an influence on the results but is still less than one on the axis scale. Additionally, we see skewed residual.
Model Adequacy
Reviewing the plots and various statistics, the residuals plot can’t fan out due to the type of variable but there does not appear to be any linear relationship between the different exterior1 categories. The fit plot resembles shows a slight positive correlation with a bit of a ‘u’ shape between the data. It appears to be bimodal which is reflective of categorical model. Overall, Exterior1 is not normally distributed or a good predictor in determining SalePrice which is reflected in the various plot, statistics and reasons provided.
Does the predicted model go through the mean value of Y for each category group?
The above Fit Plot line does appear to go through the mean value of each category but shows the distribution of values for each category based on the variability. The confidence limits but as mentioned above, reflects a slight correlation which is reflected in the slight slope of the line. The plot shows the distribution of values within each category with the largest amount of variability in the SalePrice of a house using the Exterior1 Category of category16 which is wood siding. Having the line go through the mean doesn’t really provide any value as the categories are listed in order of their assigned value.
Is this good or bad / why or why not?
I don’t think having the line adds any value and in fact, causes confusion and possible incorrect conclusions that there is some type of linear relationship when in fact there isn’t one. Additionally, it doesn’t provide any insight on which categories perform better than others or the volume of data in each.
Categorical Value Conclusion
While the Exterior1 category analysis is interesting to conduct, the results in being able to predict the SalePrice is low with the Adjusted R-Square of 0.0170. Additionally, the plots and charts provided additional insight into Cook’s D observations with few influential peaks, a heavy tailed Q-Q plot. No surprise, the does not appear to be a linear relationship between Exterior1 and SalePrice. Finally, the Fit Plot would not be my preferred method of making categorical assessment of data as they add very little value and can be misleading.
Part A – Task 4 Analysis
Of the three models described above, the best model is TotalSqFt as it has the highest Adjusted R-Square at .509, the errors across the values appear to be independent of the variables and thus considered to be homoscedastic. Cook’s D has only a few data points that are having an influential influence. These points along with the outliers need some further research but do not expect them to change the outcomes of the above analysis. Additionally, the F value is significantly greater than one. Again, we have conflicting information based on the plots and the R-Square values which needs to be further investigated.
Part B – Task 5 Regression Model Using Two Variables
Variables:
The two variables are TotalFlrSF and GrLivArea.
Equation:
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | 11688 | 3238.62266 | 3.61 | 0.0003 | 0 |
TotalFlrSF | 1 | 185.03078 | 22.40381 | 8.26 | <.0001 | 119.15477 |
GrLivArea | 1 | -71.69170 | 22.29838 | -3.22 | 0.0013 | 119.15477 |
Root MSE | 55886 | R-Square | 0.5109 |
Dependent Mean | 180796 | Adj R-Sq | 0.5106 |
Coeff Var | 30.91129 |
- Sale Price = $11,688 + $185.03078 x TotalFlrSF – $71.69170 x GrLivArea
For each unit increase in TotalFlrSF the SalePrice increases by $185 yet a decrease of $71.69 impacts SalePrice for each unit increase in GrLivArea. The above equation is different from the simple linear regression models above with the negative value each unit of GrLivArea has on SalePrice. This is not intuitive but because of the unusual choice in variables and the obvious overlap, it makes logical sense. Additionally, the Adjusted R-Square is .5106 again reflected the amount of variability that is accounted for in using these two variables. However, this Adjusted R-Square has only improved slighted from the simple linear regression model using TotalFlrSF which had an Adjusted R-Square of .5090. Again, this is not unexpected due to the overlap in the variables.
By adding one unit of TotalFlrSF with all other factors remaining constant, we can see how the SalePrice changes. However, one unit of TotalFlrSF and GrLivArea are not the same and need to ensure that this is not confused.
Goodness-of-Fit
Based on the above, we have a small p-value reflecting that each variable is significant and thus, we cannot reject the null hypothesis. However, unlike the prior variables, we have a Variance inflation number of 119.15477 which is significantly outside of the acceptable range of 1 to 5. Thus, we have multi-linearity issues. We can see a lot of similarities with TotalFlSF with the fanning out pattern in the Predicted Value and Residual plot, the Q-Q plot again, deviates from the line and Cook’s D has only two influential points that exceeding 0.2 which is less than one. Again, the Adjusted R-Square is the largest out of all of the models analyzed with a value of 0.5106. Homoscedasticity looks for errors that aren’t independent causing the plot to be more linear looking or that flares out as x increases. Again, there is a clear clustered pattern here. Above, we can see compare TotalFlrSF and GrLivArea which, for the most part, appear to be identical.
Better Fit
As mentioned earlier, we have only accounted for .5106 of the variability leaving another .4894 unexplained. While this has increased from the prior models it is only a small increase. Thus, a significant amount of additional work is required to reduce this value.
Regression Model Using Two Variables Conclusion
Combining two continuous explanatory variables TotalFlSF and GrLivArea provided similar results to the simple linear regression model using TotalFlSF. This is not surprising, as GrLivArea is encapsulated in TotalFlSF as such the equation has a negative component to it with a small increase to the Adjusted R-Square value. The p-value reflects that each variable is significant and thus, we can reject the null hypothesis. However, the variance influence exceeds our normal range reflecting that we have multi-linearity issues which isn’t a surprise. Evaluating the plots based on the multiple variables is again, very similar to the analysis on TotalFlSF with the a fanning out in the Predicted Value and Residual plot. Thus, this we again have normality issues.
Part B – Step 6 Using Three Variables
Variables:
The three variables are TotalFlrSF, GrLivArea and MiscVal.
Equation:
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | 11105 | 3227.36083 | 3.44 | 0.0006 | 0 |
TotalFlrSF | 1 | 186.43890 | 22.31325 | 8.36 | <.0001 | 119.17355 |
GrLivArea | 1 | -72.39791 | 22.20695 | -3.26 | 0.0011 | 119.15954 |
MiscVal | 1 | -9.14957 | 1.82008 | -5.03 | <.0001 | 1.00470 |
Root MSE | 55886 | R-Square | 0.5151 |
Dependent Mean | 180796 | Adj R-Sq | 0.5146 |
Coeff Var | 30.78393 |
- Sales Price = $11,105 + $186.4389 x TotalFlrSF – $72.39791 x GrLivArea – $9.14957 x MiscVal
For each unit increase in TotalFlrSF the SalePrice increases by $186.44, a small increase of $1.40 over the equation in part 5, and similarly a decrease of $72.40 (a further decline from $71.69 in part 5) but now there is another decrease which is $9.15 per unit of MiscVal. The decrease in value in MiscVal on SalePrice is not as intuitive as TotalFlrSF and GrLivArea however due to the low volume and value of the MiscVal data points it is understandable there the variable would have little impact on SalePrice.
Goodness-of-Fit
We can see a lot of similarities with part 5, the p-values are small reflecting that each variable is significant and thus, we reject the null hypothesis. The Variance Inflation for TotalFlSF is 119.17355, GrLivArea is 119.159 which exceed the upper boundaries of our 1-5 range and the MiscValue is close to the bottom of our acceptable range with a value of 1.00470. We again see the Predicted Value to the Residual having a fanning out pattern which reflects that the variance is increasing. Cook’s D plot has changed in our axis measures but still has only a few influential points. The Q-Q plot is again, deviating from the line and is heavy tailed. The Predicted Value to Sales Price shows a cluster of values. Thus, we still have the same normality issues.
Adjusted R-Square increased from .5106 to .5146 again reflected the amount of variability that is accounted for in using these two variables. However, this is a very small increase – 0.0040 over the Adjusted R-Value in question 5 and the value of adding the MiscVal variable. The Adjusted R-Square has only improved slighted from the simple linear regression model using TotalFlrSF and the multiple variables between TotalFlSF and GrLivArea.
The Residual plots for each variable provides more of a view on how they alter depending on the variables over the results in part 5 but still do not appear to be random with distinct patterns to each variable. However, we can see how MiscVal errors do not provide any easy to see, linear relationship and has only a few values. Thus, we still have normality issues.
Changes & Better Fit
Overall, the additional variable – MiscVal has not made any real impact in predicting SalePrice. While this is no surprise, based on its low correlation value. While our Adjusted R-Value did increase, it was not significant with a small .0040 increase thus showing us that adding additional values does not necessarily improve the results or assist us in creating a better fitting equation and explaining the variability. While Adjusted R-Square plays an increased importance on the criteria in comparing the models primarily due to not wanting to make incorrect decisions by simply adding variables, the plots and charts are helpful in understand the relationships and how they impact the model.
Using Three Variables Conclusion
Adding additional variables do not necessarily improve results. By adding a low correlated variable – MiscVal the other variables within the equation received minor adjustments. While each of the variables has a small p-value we can reject the null hypothesis. Similar to the prior analysis, there are not a lot of changes with the exception being the variance influence. The review of the plots reflects similar patterns with the exception of the residual plot where the MiscValue has a different pattern but still not a pattern of randomness which we hope for. However, the analysis remained relatively the same as to having only two variables as in part 5. Additionally, adding the MiscVal variable had a minor effect in explaining the variability and did not have any significant impact on SalePrice.
Conclusion
By observing the various models and the effects of adding and reviewing various variables we can easily grasp the changes through the various plots, equations and summary statistics. Continuous variables have a much different look and resulting plots over continuous variables. Additionally, evaluating the Adjusted R-Square and F-Value provides insight into how the model is performing even when the impact is difficult to ascertain through the plots.
The model may not be appropriately specified as all variables seem to be equal which may not be the case. Additionally, we could be adding new variables without considering their association with other variables as shown between TotalFlSF and GrLivArea – removing GrLivArea would have significant effects on TotalFlSF.
Next steps in the modeling process would be to investigate a few of the influential points that have been mentioned in several parts such as Cook’s D and focus on determining whether the Adjusted R-Score can be enhanced through the evaluation of different variables and try to account for more of the variability as we all of the models have not exceeded 51.46%. Finally, transformation of the data so the conflicts between the R-squares and ODS outputs are decreased would be really helpful in creating a useful model we are confident in.