Variable Transformations: Continuous & Categorical
Variable Transformations
The Amex, Iowa housing data set build has been utilized to develop various regression models to determine the sales price of a house based on numerous variables. The variables range from highly correlated, continuous variables to categorical variables with low correlations. In this assessment, variable transformations and comparisons of Y versus Log(Y) will be conducted along with a review on the statistical and ODS outputs used to analyze and interpret the results. Additionally, outliers will be identified and determined whether they should be kept or removed along with the rational behind the process. Finally, an assessment on ways variable transformations and outlier deletion impacts the modeling process and the results will be discussed along with potential next steps.
Part A.1: Appending New Variables
As D. Montgomery explains in Introduction to Liner Regression Analysis (2012), there are a few major assumptions in regression analysis including:
- The relationship between the response y and the regressors is linear, at least approximately
- The error term has zero mean
- The error term has constant variance
- The errors are uncorrelated
- The errors are normally distributed
A small subset of the proc display of the new variable transformations log_SalePrice and log_TotalFlrSF are shown below:
Obs | TotalFullBath | TotalHalfBath | TotalBath | log_SalePrice | log_TotalFlrSF |
1 | 3 | 2 | 5 | 13.5345 | 8.37008 |
2 | 4 | 1 | 5 | 13.5211 | 8.40649 |
3 | 4 | 1 | 5 | 13.3455 | 8.19616 |
4 | 3 | 1 | 4 | 13.3294 | 7.81197 |
5 | 3 | 1 | 4 | 13.3239 | 7.76811 |
Part A: Task 2
Fit four models based on a pair-wise combination using the newly appended variables including the variable transformations.
Model A: TotalFlrSF to Predict SalePrice
Analysis of Variance | |||||
Source | DF | Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 1 | 9.518383E12 | 9.518383E12 | 3037.86 | <.0001 |
Error | 2928 | 9.174155E12 | 3133249517 | ||
Corrected Total | 2929 | 1.869254E13 |
Root MSE | 55975 | R-Square | 0.5092 |
Dependent Mean | 180796 | Adj R-Sq | 0.5090 |
Coeff Var | 30.96054 |
Parameter Estimates | |||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 11406 | 3242.59761 | 3.52 | 0.0004 |
TotalFlrSF | 1 | 113.30303 | 2.05569 | 55.12 | <.0001 |
Equation & Coefficient Interpretation
Utilizing the standard equation for SalePrice = β0 + β1x + ε our equations becomes:
- SalePrice = $11,406 + 113.30303 x TotalFlrSF
This reflects that for each unit increase in the TotalFlrSF, an increase in the independent variable would result in an average change in the mean sales price of $113.30. This assumes that all values are greater than zero but even at zero, a SalePrice of $11,406 would result. As we are evaluating houses, this would be logical but could be if perhaps a house was not livable and is considered a ‘tear down’ where someone would spend the time and money required to build a new house. However, this would be outside the norm and would require a different equation to determine these types of sale prices.
Model B: log_TotalFlrSF to Predict SalePrice
Analysis of Variance | |||||
Source | DF | Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 1 | 9.16712E12 | 9.16712E12 | 2817.86 | <.0001 |
Error | 2928 | 9.525417E12 | 3253216313 | ||
Corrected Total | 2929 | 1.8619254E13 |
Root MSE | 57037 | R-Square | 0.4904 |
Dependent Mean | 180796 | Adj R-Sq | 0.4902 |
Coeff Var | 31.54769 |
Parameter Estimates | |||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | 1 | -1068768 | 23563 | -45.36 | <.0001 |
log_TotalFlrSF | 1 | 172187 | 3243.70186 | 53.08 | <.0001 |
Equation & Coefficient Interpretation
The standard equation is SalePrice = β0 + β1x + ε
- SalePrice = $-1,068,768 + 172,187 x log_TotalFlrSF
The above equation reflects that if we increase log_TotalFlrSF by one percent, we can expect the average SalePrice to increase by $1,721.87 (172,187/100).
Model C: TotalFlrSF to Predict log_SalePrice
Analysis of Variance | |||||
Source | DF | Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 1 | 240.66674 | 240.66674 | 2865.46 | <.0001 |
Error | 2928 | 245.91951 | 0.08399 | ||
Corrected Total | 2929 | 486.58626 |
Root MSE | 0.28981 | R-Square | 0.4946 |
Dependent Mean | 12.02097 | Adj R-Sq | 0.4944 |
Coeff Var | 2.41086 |
Parameter Estimates | |||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 11.16922 | 0.01679 | 665.30 | <.0001 |
TotalFlrSF | 1 | 0.00056973 | 0.00001064 | 53.53 | <.0001 |
Equation & Coefficient Interpretation
The standard equation is SalePrice = β0 + β1x + ε
- log_SalePrice = $11.16922 + .00056973 x TotalFlrSF
The above equation reflects that as we increase log_SalePrice by one unit, we can expect the average change in the mean TotalFlrSF to increase by .056973% (.00056973 x 100%).
Model D: log_TotalFlrSF to Predict log_SalePrice
Analysis of Variance | |||||
Source | DF | Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 1 | 258.85775 | 258.85775 | 3328.24 | <.0001 |
Error | 2928 | 227.72851 | 0.07778 | ||
Corrected Total | 2929 | 486.58626 |
Root MSE | 0.27888 | R-Square | 0.5320 |
Dependent Mean | 12.02097 | Adj R-Sq | 0.5318 |
Coeff Var | 2.31998 |
Parameter Estimates | |||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 5.38090 | 0.11521 | 46.70 | <.0001 |
log_TotalFlrSF | 1 | 0.91499 | 0.01586 | 57.69 | <.0001 |
Equation & Coefficient Interpretation
The standard equation is:
- Log_SalePrice = 5.38090 + .91499 x log_TotalFlrSF
The above equation reflects that as we increase SalePrice by one percent, we can expect the average change in the mean TotalFlrSF to increase by .91499 percent.
Summary:
Based on the above, model ‘d’ is the best fitting model as it has the highest Adjusted R-Square value and the largest F-Value. The Adjusted R-Square and F-Value have been chosen as the key metrics as they provide a macro level view of how the model fits. The Adjusted R-Square measures how close the data points are to the fitted regression line and reflects the amount of variability of the response data around its mean. Thus, in model ‘d’, we have accounted for 53.2% of the variability but still have 46.8% unexplained.
Evaluating the models based on the various plots we can see that:
- The t-values in each of the models range from -45.36 to 665.3
- The p-values are all statistically significant
- Model A & Model B: The residual values similar and have distinct grouping which does not appear to be random. Model B has more of a fan shape to it over model A, little differences in the Q-Q plot but has a heavy tail. The predicted versus the sale price has does not follow the 45-degree line.
- Model C & Model D: The residual versus the predicted look similar but still do not appear to be random. The Q-Q plot is light tailed. The predicted versus the sale price looks appears to follow the 45 degree line a bit more than in models A & B.
- Model A & C each have a few values that exceed .20 reflecting that each of these models contain influential data points
- Model B & D has no values that exceed .20 reflecting that it has few influential data points.
- The residuals for the log_SalePrice also looks the best out of all of the models as it appears to be more random
- Model D follows the 45-degree line better than the other models but still has linearity issues.
Concerns on the ‘best’ fitting model
My concerns on the best fitting model, which is model D, is that it is the best fitting based on the Adjusted R-Square and F-Value but analyzing the impact of the few influential data points has not yet been undertaken. Additionally, ~47% of variability remains unexplained and the residual value does not appear random as it has a distinct grouping of data. The residuals plot still seem to have a pattern to it and the Q-Q plot shows that the data does not yet fit the line and in the residuals, we still have outliers. Finally, ensuring taking a step back to ensure that the transformation actually adds value needs to be kept in mind and not to be forgotten is that we need to ensure normality before applying the transformation. So, while we have improved the model, we still have additional considerations to address.
Concerns about using Variable Transformations
Concerns about using the variable transformations is eased by comparing the original data versus the transformed data is difficult due to the change in scales. Additionally, concerns about using transformed variables makes continuous non-normal data, normal to increase the validity of the associated statistical analysis. Personally, I think I prefer Model A because it is less complicated to understand including any consumer considering the purchase of a house. Using model D, is not as intuitive based on its scales making interpretation more difficult.
How is the interpretation of the log (SalePrice) model different from the price model?
A one unit change in the independent variable (TotalFlrSF) that results in a change in the regression coefficient of the expected value of the dependent variable while all the predictors remain constant. In the discussion above, I explained the equation in a percentage basis for ease of understanding only.
Conclusion
Using log_SalePrice and log_TotalFlrSF provides a different perspective and method of calculation over the prior equations. While variable transformations are not as intuitive, they do provide a better model in terms of Adjusted R-Square and F-Value. Additionally, the plots of the data are more aligned with what we would like to see – the residuals don’t look as random as the other plots, the Q-Q plot is light tailed. The predicted versus the sale price seems to fit the 45-degree line better, Cook D’s values only has a few influential values that exceed .20. Based on the above, model ‘d’ is the best fitting model as it has the highest Adjusted R‐Square value and the largest F Value but have only accounted for 5.2% of the variability. Additionally, comparing the different models have different commonalities for instance, Model A and Model B have distinct grouping which does not appear to be random. Model C and Model D, the Q‐Q plot is lighter tailed. Model A and C each have a few values that exceed .20 and finally, in model B and D in Cook’s D, there are no values that exceed .20.
Part A.3: Correlate Continuous Variables
The below table reflects the correlation for SalePrice, log_SalePrice and sqrt_SalePrice for continuous variables:
TotalFlrSF has the largest correlation out of all three variables. This are logical based on the square footage and how it ties into sales price value by consumers.
The above plots are very. In SalePrice vs TotalFlrSF the line is not linear and the outliers assist in this nonlinearity. In the plot with SalePrice vs. log_TotalFlrSF we have a smooth curved line that is not straight. Log_SalePrice vs TotalFlrSF is not smooth but for a portion, seems to be linear with the 45-degree line. Finally, in the TotalFlrSF vs sqrt_SalePrice we don’t have a smooth line but it does follow the 45-degree for a larger portion of the plot than in the previous plot and thus, appears to be our best fitting model. The outliers seem to be having a large effect on how the curve changes direction which is in line with the analysis and Cook D’s provided insights.
Constant variance look at the regression line as look at SalePrice as TotalSqFt the variance increases and violates the constancy of variance. The log SalePrice the variance is more constant over the SalePrice so the assumption has now been met whereas in previous models it was not.
Part A.4: Log improvements
The conditions where the long variable transformations and the response variable (Y) could improve the model fit when the effect of the outliers is reduced. When applied to the independent variable (in this case, TotalFlrSF) to help create a linear relationship based on the dependent variable. Variable transformation on the data, while being perhaps less intuitive, assists in providing a more effective and useful model.
Analysis of Variance | |||||
Source | DF | Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 1 | 11722622 | 11722622 | 3327.03 | <.0001 |
Error | 2928 | 10316657 | 3523.44838 | ||
Corrected Total | 2929 | 22039279 |
Root MSE | 59.35864 | R-Square | 0.5319 |
Dependent Mean | 416.26208 | Adj R-Sq | 0.5317 |
Coeff Var | 14.25992 |
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 30.16558 | 6.78295 | 4.45 | <.0001 |
Sqrt_TotalFlrSF | 1 | 10.11869 | 0.17543 | 57.68 | <.0001 |
The standard equation is:
- Log_SalePrice = 230.76398 + .12408 x TotalFlrSF
Thus, a 1% increase in log_SalePrice will occur as one unit is increased in TotalFlrSF while holding all other predictors constant.
Comparison of Sqrt_SalePrice to log_SalePrice
ODS output with the variable transformations based on SQRT
Comparing the sqrt_SalePrice to log_SalePrice the Residual vs Predicted Value and RStudent plot appear to have more of a fan shape to it over the log_SalePrice ODS plots. Additionally, the Q-Q plot has a different fit to the line with the sqrt_SalePrice appearing to have a heavier tail over log_SalePrice. The Predicted Value in the sqrt_SalePrice appears to be more dispersed than in the log_SalePrice and has more outliers. Cook’s D has a very different scale to it indicating that the log_SalePrice outliers have more influence than the sqrt_SalePrice outliers. We also have a small increase in the Adjusted R-Square with an increase from 0.4944 in the log_SalePrice to .5317 in the sqrt_SalePrice. Finally, we see an increase in the F-Value with the sqrt_SalePrice with a value of 3327.03 versus 2865.46 in log_SalePrice.
In comparing the residuals for sqrt_SalePrice versus log_SalePrice there is still a cluster of values occurring. The log_SalePrice appears to be more compressed but both are very similar in how the outliers appear.
The sqrt_SalePrice fit plot appears to be more compressed with fewer values outside of the 95% prediction limits. However, it has significantly more outliers. They are both still very similar but the sqrt_SalePrice has more of a fan shape to it.
Log Improvements Conclusion
Utilizing variable transformations to create log_SalePrice we have seen how it has impacted the SalePrice and TotalFlrSF. The residual and Predict Values changed shape to more of a fan shape when the values were transformed. Additionally, the log_SalePrice Q-Q plot fit the line significantly better using the log values over the non-transformed values. The residuals for log_SalePrice and log_TotalFlrSF appeared much more random then any of the other three models and while it still does not appear to be completely random as we would prefer, the different is easy to spot. The Fit Plot also improved with more data points remaining within the 95% prediction limits. The adjusted models also had the highest Adjusted R-Square and F-Value. Evaluating the continuous variables with log_SalePrice was interesting in how the regression line smoothed out when the log_TotalFlrSF was used versus other variables that appeared choppy with drastic changes in direction. Adding the square root transformation variable to SalePrice was also conducted. While somewhat similar to log_SalePrice there were some modest changes to the scatter plots including a change in the Residial vs predicted values, the tails in the Q-Q plot, Cook’s D scale. A different shape of cluster in the residuals plot and the fit plot appearing more compressed but having more outliers. The F-value was larger in the sqrt_SalePrice but the Adjusted R-Square was larger in the log_SalePrice.
Part B Step 5: Outliers
To assist in finding outliers, the proc univariate normal plot was utilized.
proc univariate normal plot data=ames;
var SalePrice;
histogram SalePrice/normal;
This allows us to see the outliers and ranges:
|
|
Based on the above values, an evaluation on SalePrice using the 99% quantile value of $457,347 along with the 1% lowest value of $61,500. The outliers can easily be seen in the distribution and probability plot in the extreme high and low values along with the quantiles plot.
The two-standard deviation rule is that 95% of all observations will fall within two standard deviations. Assuming, that we are to apply the two-standard deviation rule to locate all observations that exceed two or more standard deviations from the mean and remove them to minimize issues with non-normal distributions regardless of whether the observations are legitimate or not. I do not believe it should be used here as there are a lot of values that exceed two standard deviations from the mean. Removing the data that exceeds two standard deviations would result in a large loss of data and impact bias, accuracy and power of the results. If outliers are illegitimately included in the data i.e. they were errors then they should be removed.
Based on the above described quantile values of 1% and 99%, a code was assigned to these observations with some of the lowest and highest observations in SalePrice
|
|
After applying the codes and removing the observations that met the above criteria, we can see the results. Below the outlier_def codes are as follows:
- Code 1 reflects the values when SalePrice <= 61500
- Code 2 reflects SalePrice < 457347 & SalePrice > 61500
- Code 3 reflects SalePrice >= 457347
By using the extreme value, found 30 values and was confident in removing them as they are a small percentage of the data.
outlier_def | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
1 | 30 | 1.02 | 30 | 1.02 |
2 | 2870 | 97.95 | 2900 | 98.98 |
3 | 30 | 1.02 | 2930 | 100.00 |
outlier_def=1 | outlier_def=3 | ||||||||||||||||||||||||||||||
|
|
Quantiles for Normal Distribution | ||
Percent | Quantile | |
Observed | Estimated | |
1.0 | 61500.0 | -5048.18 |
5.0 | 87500.0 | 49394.14 |
10.0 | 105250.0 | 78417.14 |
25.0 | 129500.0 | 126913.30 |
50.0 | 160000.0 | 180796.06 |
75.0 | 213500.0 | 234678.82 |
90.0 | 281356.5 | 283174.98 |
95.0 | 335000.0 | 312197.98 |
99.0 | 457347.0 | 366640.30 |
Outliers Conclusion
As reflected above, the basis of determining which observations to remove were based on the SalePrice values at the 99% and 1% level. In reality we would not simply delete these values but flag them as observations to be investigated and evaluated upon on whether they are in error or are legitimate data points. Obviously, observations that are errors are removed and legitimate observations would be remain intact.
For the purposes of understanding the impact outliers can have on our model, we simply removed all of the values within this threshold. After we removed the selected outliers, we compared GrLivArea, TotalFlrSF and GrLivArea and TotalFlrSF, GrLivArea and MiscVal to determine whether the removal of the outliers helped the model to become more linear. In each case, the Adjusted R-Square and F-Values decreased but the plots improved reflecting positive improvements.
Step 6: Cleaned Data
With the outliers removed, we now have a new quantiles table and distribution plot for SalePrice.
Quantiles (Definition 5) | |
Level | Quantile |
100% Max | 455000 |
99% | 405000 |
95% | 320000 |
90% | 275500 |
75% Q3 | 212900 |
50% Median | 160000 |
25% Q1 | 130000 |
10% | 107950 |
5% | 91000 |
1% | 75000 |
0% Min |
62383 |
Going back to the analysis on GrLivArea and applying the same data analysis to the data but this time with the cleaned data set we have the following results in the highest ranked variables:
Number in Model |
R-Square | Adjusted R-Square |
C(p) | Variables in Model |
1 | 0.4503 | 0.4501 | 574.7237 | GrLivArea |
1 | 0.2179 | 0.2176 | 1815.424 | MasVnrArea |
1 | 0.1274 | 0.1270 | 2298.385 | YearBuilt |
1 | 0.1270 | 0.1267 | 2300.429 | HouseAge |
1 | 0.0681 | 0.0677 | 2615.237 | YearRemodel |
1 | 0.0636 | 0.0632 | 2638.890 | TotalBath |
1 | 0.0608 | 0.0604 | 2654.258 | GarageArea |
Comparison to the Full Data Set
Evaluating the same variables as used with the full data set. Using the variable GrLivArea
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 30406 | 3352.95617 | 8.61 | <.0001 |
GrLivArea | 1 | 99.03272 | 2.25147 | 43.99 | <.0001 |
Root MSE | 59250 | R-Square | 0.4503 |
Dependent Mean | 178239 | Adj R-Sq | 0.4501 |
Coeff Var | 29.70718 |
Equation & Interpret each Coefficient
The new equation for the GrLivArea is SalePrice = $30,406 x 99.03272 x GrLivArea. Which reflects that for each average unit increase in GrLivArea the Sale Price increases by $99.03. This seems reasonable and logical.
Analysis of Variance | |||||
Source | DF | Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 2 | 6.604169E12 | 3.302084E12 | 1258.76 | <.0001 |
Error | 2867 | 7.520959E12 | 2623285390 | ||
Corrected Total | 2869 | 1.412513E13 |
Root MSE | 51218 | R-Square | 0.4675 |
Dependent Mean | 178437 | Adj R-Sq | 0.4672 |
Coeff Var | 28.70369 |
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | 13708 | 3958.59881 | 3.46 | 0.0005 | 0 |
TotalFlrSF | 1 | 14.03133 | 1.91693 | 7.32 | <.0001 | 1.01991 |
GrLivArea | 1 | 96.26726 | 2.00021 | 48.13 | <.0001 | 1.01991 |
Equation & Interpret each Coefficient
The new equation for is:
- SalePrice = $13,708 + $14.03 x TotalFlrSF + $96.27 x GrLivArea
Which reflects that for each unit increase in GrLivArea the Sale Price increases by $14.03. This seems reasonable and logical.
For each unit increase in TotalFlrSF the SalePrice increases by $14.03 and $96.27 for each unit increase in GrLivArea. The above equation is different from the prior version where the GrLivArea was negative.
Variables:
The three variables are TotalFlrSF, GrLivArea and MiscVal.
Analysis of Variance | |||||
Source | DF | Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 3 | 6.607801E12 | 2.2026E12 | 839.75 | <.0001 |
Error | 2866 | 7.517327E12 | 2622933420 | ||
Corrected Total | 2869 | 1.412513E13 |
Root MSE | 51215 | R-Square | 0.4678 |
Dependent Mean | 178437 | Adj R-Sq | 0.4672 |
Coeff Var | 28.70176 |
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | 13634 | 3958.83629 | 3.44 | 0.0006 | 0 |
TotalFlrSF | 1 | 14.21735 | 1.92331 | 7.39 | <.0001 | 1.02684 |
GrLivArea | 1 | 96.19807 | 2.00094 | 48.08 | <.0001 | 1.02079 |
MiscVal | 1 | -2.16107 | 1.83648 | -1.18 | 0.2394 | 1.00713 |
Equation:
- Sales Price = $13,634 + $14.21735 x TotalFlrSF + $96.19807 x GrLivArea – $2.16107 x MiscVal
For each unit increase in TotalFlrSF the SalePrice increases by $14.22, a large decrease of $186.43 when the full dataset was utilized. Additionally, the decrease in SalePrice of $72.40 when the full dataset was utilized has now changed to an increase in salePrice of $96.20 per unit of GrLivArea and finally, the MiscValue declines by $2.16 per unit increase – this is a decline over the $9.15 decrease when the full dataset was utilized. MiscValue has little impact on SalePrice so this value is logical.
Summary Table:
Cleaned Data | Adj R-Square | F Value | P Value |
SalePrice = $30,406 x 99.03 x GrLivArea. | 0.4501 | 1934.76 | <.0001 |
SalePrice = $13,708 + $14.03 x TotalFlrSF + $96.27 x GrLivArea | 0.4675 | 1258.76 | <.0001 |
Sales Price = $13,634 + $14.21735 x TotalFlrSF + $96.19807 x GrLivArea – $2.16107 x MiscVal | 0.4672 | 839.75 | <.0001 |
Full Data | |||
GrLivArea is SalePrice = $13,290 x 111.694 x GrLivArea. | 0.4994 | 2922.59 | <.0001 |
Sale Price = $11,688 + $185.03078 x TotalFlrSF – $71.69170 x GrLivArea | 0.5106 | 1528.94 | <.0001 |
Sales Price = $11,105 + $186.4389 x TotalFlrSF – $72.39791 x GrLivArea – $9.14957 x MiscVal | 0.5146 | 1036.17 | <.0001 |
Analysis
GrLivArea: As we can see above, in each case the goodness-of-fit decreased in each model both in terms on the Adjusted R-Square and F-Value. The variable GrLivArea using the cleaned data still looks quite similar to the full data set. The residuals still have a fan shape to them reflecting that the normality assumption is violated. The Q-Q plot has smoother tails but still is not on the line and still have outliers. Cook’s D is similar. range has also increased. The residuals are more compressed with fewer outliers and the fit plot also has fewer outliers but is still fan shaped. and fit plot also look very similar with minimal changes.
TotalFlrSF and GrLivArea has more noticeable changes with some minor slimming in the tails of the Q-Q plot, fewer top outliers in the predicted versus average SalePrice with a bit more compression. Cook’s D is very similar looking with only two points >.20. The residual histogram appears quite different with more bars in the middle. Finally, there is less of a funnel shape and appearing more random in the residuals by regressors and fewer outliers.
In the TotalFlrSF, GrLivArea and MiscValue, we again seem the slimmer tails in the Q-Q plot, the predicted value vs sale value appears more compressed with fewer outliers. Cook D’s scale changed significantly from 3 to .4. The residual histogram again reflected more middle bars in the middle. Finally, the regressors are less fanned shape, appear more random and compressed with fewer outliers in the TotalFlrSF which is similar to GrLivArea. MiscValue has fewer outliers with the ones remining being closer to the line.
Overall, while the Adjusted R-Square and F-Values have decreased, the graphical plots with the cleaned data appear to be in better shape. While the alterations are not major, all of the little differences in each of the plots seem to show that the relationship between the variables is improved. While, there is still room for improvement, the minor change in the removal of the data at the 99% and 1% mark have made some positive improvements. Thus, the transformation process was worthwhile in doing and provided some positive improvements but we still have issues to address.
Step 7: Influential Points
Using the threshold of .69 as the DFFITS value and removing the values that did not fit, the data changes as follows:
Before the DFFITS removal | After the DFFITS removal | ||||||||||||
|
|
Thus, based on the above, eight observations were removed using the .69 threshold.
Analysis of Variance | |||||
Source | DF | Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 21 | 7.27839E12 | 3.4659E11 | 172.43 | <.0001 |
Error | 2334 | 4.691511E12 | 2010073476 | ||
Corrected Total | 2355 | 1.19699E13 |
Root MSE | 44834 | R-Square | 0.6081 |
Dependent Mean | 178179 | Adj R-Sq | 0.6045 |
Coeff Var | 25.16229 |
Additionally, the ODS is as follows:
Thus, by removing the additional eight observations, we can see how the Predicted Value vs Residual plot has fewer outliers and the scale has changed from 600,000 to 400,000. Additionally, in the Q-Q Plot the outliers at the bottom of the plot have disappeared. Similarly, the Predicted Values vs Sale price plot also shows a change in scale from 600,000 to 400,000 with the observations not more compressed on the 45-degree line. Additionally, the Cook’s D plot scale has changed from .15 to a maximum of .025 with one value. Most are less than .015. We also can see the change in the Adjusted R-Square with an increase in value to .6045 from .5576 and have an F-value of 172.43.
Overall, the additional removal of the eight points, has again made a pretty big improvement to the model and thus, making it become more linear.
Step 8: Conclusion
Overall, transforming the variables and detecting and deleting outliers are two effective ways to assist in creating a better fitting model. Naturally, understanding the goals of the model are key in determining if they should be employed and to what degree. Additionally, the use of these techniques is more of an art than a science. If the model already meets the needs of its purpose than perhaps these methods are not required. However, if the model is not appropriate than utilizing these techniques to transform the model to become more linear is appropriate.
Employing a process to help determine which observations are outliers and requiring an evaluation can be easily accomplished as outlined within this report. While we have been heavy – handed in the removal of outliers, in reality, each outlier needs to be evaluated to see if it is valid or if it is an error. If an observation is an error, then it should be removed. However, removing legitimate observations will distort the model and should not be removed.
If done correctly, the transformation and deletion of outliers will benefit the model by creating a better, more appropriate model that is more linear. However, it does require additional processes and a certain finesse to determine the appropriate thresholds values. As easily observed from the above discussion, the choice of thresholds can drastically impact the number of observations it impacts and needs to be carefully selected and employed.
In summary, in Part A of this report we appended new variables, log_SalePrice and log_TotalFlrSF. We then used these new variables and paired them in four different methods: TotalFlrSF to predict SalePrice, log_TotalFlrSF to predict SalePrice, TotalFlrSF to Predict log_SalePrice and log_TotalFlrSF to predict log_SalePrice. In each combination of variables, an equation and interpretation of the coefficients was provided along with an analysis of the SAS generated Output Delivery System (ODS). Additional metrics such as Adjusted R-Square, F-Value were included in determining the fit of the model. Initially, we found that the variable pair of log_SalePrice and log_TotalFlrSF was the ‘best’ fitting model based on an in-depth analysis of both ODS plots and Goodness-of-Fit data. However, we also correlated TotalFlrSF with log_SalePrice and evaluated how well the observations fit the line.
Next, we evaluated the conditions where the log transformation improved the fit and also expanded to include another transformation but using square root. Again, an analysis on the ODS plots was provided along with the Goodness-of-Fit measures such as the Adjusted R-Square which reflected positive improvements over the log_SalePrice. We observed the difference in the residual and fit plots which were minor but still where the sqrt_SalePrice had some improvements over the log_SalePrice.
In the second part of the report, we focused on outliers. We chose the 99% and 1% level as our basis of determining which observations to remove based on the distribution and probability plot of SalePrice. As mentioned above, in reality we would not be this heavy-handed in their removal but instead evaluate each observation to determine whether it is an error or legitimate would be conducted. However, for the purposes of understanding the impact outliers can have on our model, we simply removed all of the values within this threshold. After we removed the selected outliers, we had a dataset of 2,870 observation that we compared GrLivArea, TotalFlrSF and GrLivArea and TotalFlrSF, GrLivArea and MiscVal to determine whether the removal of the outliers helped the model to become more linear. In each case, the Adjusted R-Square and F-Values decreased but the plots improved reflecting positive improvements.
Finally, we undertook another method of removing outliers which was based on DFFITS which is a method to reflect how influential an observation is within a statistical regression. Based on our threshold of .69, we removed eight observations. Reducing our dataset down to 2,862. While the number of observations removed are pretty minimal, the results on the various ODS plots were easy to spot and the Adjusted R-Square improved to .6045. Thus, removing these influential observations clearly had an impact on transforming the model to become more linear.
Next steps could include an evaluation of the different regressors that have the largest influence on the regression equation and ensuring the chosen variables are logical. Additionally, attaining additional data to validate the model would be helpful in understanding how well the model performs in predicting sales prices. In some cases, data is split so that data can be used both in the creation of the model and the determine how predictive if it on the remaining data. While there are pros and cons to the methods, in our study of the Ames housing data, we did not employ this technique and hence, additional data would be helpful in determining the performance of our model.