Variable Transformations: Continuous & Categorical
Variable Transformations
The Amex, Iowa housing data set build has been utilized to develop various regression models to determine the sales price of a house based on numerous variables. The variables range from highly correlated, continuous variables to categorical variables with low correlations. In this assessment, variable transformations and comparisons of Y versus Log(Y) will be conducted along with a review on the statistical and ODS outputs used to analyze and interpret the results. Additionally, outliers will be identified and determined whether they should be kept or removed along with the rational behind the process. Finally, an assessment on ways variable transformations and outlier deletion impacts the modeling process and the results will be discussed along with potential next steps.
Part A.1: Appending New Variables
As D. Montgomery explains in Introduction to Liner Regression Analysis (2012), there are a few major assumptions in regression analysis including:
 The relationship between the response y and the regressors is linear, at least approximately
 The error term has zero mean
 The error term has constant variance
 The errors are uncorrelated
 The errors are normally distributed
A small subset of the proc display of the new variable transformations log_SalePrice and log_TotalFlrSF are shown below:
Obs  TotalFullBath  TotalHalfBath  TotalBath  log_SalePrice  log_TotalFlrSF 
1  3  2  5  13.5345  8.37008 
2  4  1  5  13.5211  8.40649 
3  4  1  5  13.3455  8.19616 
4  3  1  4  13.3294  7.81197 
5  3  1  4  13.3239  7.76811 
Part A: Task 2
Fit four models based on a pairwise combination using the newly appended variables including the variable transformations.
Model A: TotalFlrSF to Predict SalePrice
Analysis of Variance  
Source  DF  Sum of Squares 
Mean Square 
F Value  Pr > F 
Model  1  9.518383E12  9.518383E12  3037.86  <.0001 
Error  2928  9.174155E12  3133249517  
Corrected Total  2929  1.869254E13 
Root MSE  55975  RSquare  0.5092 
Dependent Mean  180796  Adj RSq  0.5090 
Coeff Var  30.96054 
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t 
Intercept  1  11406  3242.59761  3.52  0.0004 
TotalFlrSF  1  113.30303  2.05569  55.12  <.0001 
Equation & Coefficient Interpretation
Utilizing the standard equation for SalePrice = β0 + β1x + ε our equations becomes:
 SalePrice = $11,406 + 113.30303 x TotalFlrSF
This reflects that for each unit increase in the TotalFlrSF, an increase in the independent variable would result in an average change in the mean sales price of $113.30. This assumes that all values are greater than zero but even at zero, a SalePrice of $11,406 would result. As we are evaluating houses, this would be logical but could be if perhaps a house was not livable and is considered a ‘tear down’ where someone would spend the time and money required to build a new house. However, this would be outside the norm and would require a different equation to determine these types of sale prices.
Model B: log_TotalFlrSF to Predict SalePrice
Analysis of Variance  
Source  DF  Sum of Squares 
Mean Square 
F Value  Pr > F 
Model  1  9.16712E12  9.16712E12  2817.86  <.0001 
Error  2928  9.525417E12  3253216313  
Corrected Total  2929  1.8619254E13 
Root MSE  57037  RSquare  0.4904 
Dependent Mean  180796  Adj RSq  0.4902 
Coeff Var  31.54769 
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t 
Intercept  1  1068768  23563  45.36  <.0001 
log_TotalFlrSF  1  172187  3243.70186  53.08  <.0001 
Equation & Coefficient Interpretation
The standard equation is SalePrice = β0 + β1x + ε
 SalePrice = $1,068,768 + 172,187 x log_TotalFlrSF
The above equation reflects that if we increase log_TotalFlrSF by one percent, we can expect the average SalePrice to increase by $1,721.87 (172,187/100).
Model C: TotalFlrSF to Predict log_SalePrice
Analysis of Variance  
Source  DF  Sum of Squares 
Mean Square 
F Value  Pr > F 
Model  1  240.66674  240.66674  2865.46  <.0001 
Error  2928  245.91951  0.08399  
Corrected Total  2929  486.58626 
Root MSE  0.28981  RSquare  0.4946 
Dependent Mean  12.02097  Adj RSq  0.4944 
Coeff Var  2.41086 
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t 
Intercept  1  11.16922  0.01679  665.30  <.0001 
TotalFlrSF  1  0.00056973  0.00001064  53.53  <.0001 
Equation & Coefficient Interpretation
The standard equation is SalePrice = β0 + β1x + ε
 log_SalePrice = $11.16922 + .00056973 x TotalFlrSF
The above equation reflects that as we increase log_SalePrice by one unit, we can expect the average change in the mean TotalFlrSF to increase by .056973% (.00056973 x 100%).
Model D: log_TotalFlrSF to Predict log_SalePrice
Analysis of Variance  
Source  DF  Sum of Squares 
Mean Square 
F Value  Pr > F 
Model  1  258.85775  258.85775  3328.24  <.0001 
Error  2928  227.72851  0.07778  
Corrected Total  2929  486.58626 
Root MSE  0.27888  RSquare  0.5320 
Dependent Mean  12.02097  Adj RSq  0.5318 
Coeff Var  2.31998 
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t 
Intercept  1  5.38090  0.11521  46.70  <.0001 
log_TotalFlrSF  1  0.91499  0.01586  57.69  <.0001 
Equation & Coefficient Interpretation
The standard equation is:
 Log_SalePrice = 5.38090 + .91499 x log_TotalFlrSF
The above equation reflects that as we increase SalePrice by one percent, we can expect the average change in the mean TotalFlrSF to increase by .91499 percent.
Summary:
Based on the above, model ‘d’ is the best fitting model as it has the highest Adjusted RSquare value and the largest FValue. The Adjusted RSquare and FValue have been chosen as the key metrics as they provide a macro level view of how the model fits. The Adjusted RSquare measures how close the data points are to the fitted regression line and reflects the amount of variability of the response data around its mean. Thus, in model ‘d’, we have accounted for 53.2% of the variability but still have 46.8% unexplained.
Evaluating the models based on the various plots we can see that:
 The tvalues in each of the models range from 45.36 to 665.3
 The pvalues are all statistically significant
 Model A & Model B: The residual values similar and have distinct grouping which does not appear to be random. Model B has more of a fan shape to it over model A, little differences in the QQ plot but has a heavy tail. The predicted versus the sale price has does not follow the 45degree line.
 Model C & Model D: The residual versus the predicted look similar but still do not appear to be random. The QQ plot is light tailed. The predicted versus the sale price looks appears to follow the 45 degree line a bit more than in models A & B.
 Model A & C each have a few values that exceed .20 reflecting that each of these models contain influential data points
 Model B & D has no values that exceed .20 reflecting that it has few influential data points.
 The residuals for the log_SalePrice also looks the best out of all of the models as it appears to be more random
 Model D follows the 45degree line better than the other models but still has linearity issues.
Concerns on the ‘best’ fitting model
My concerns on the best fitting model, which is model D, is that it is the best fitting based on the Adjusted RSquare and FValue but analyzing the impact of the few influential data points has not yet been undertaken. Additionally, ~47% of variability remains unexplained and the residual value does not appear random as it has a distinct grouping of data. The residuals plot still seem to have a pattern to it and the QQ plot shows that the data does not yet fit the line and in the residuals, we still have outliers. Finally, ensuring taking a step back to ensure that the transformation actually adds value needs to be kept in mind and not to be forgotten is that we need to ensure normality before applying the transformation. So, while we have improved the model, we still have additional considerations to address.
Concerns about using Variable Transformations
Concerns about using the variable transformations is eased by comparing the original data versus the transformed data is difficult due to the change in scales. Additionally, concerns about using transformed variables makes continuous nonnormal data, normal to increase the validity of the associated statistical analysis. Personally, I think I prefer Model A because it is less complicated to understand including any consumer considering the purchase of a house. Using model D, is not as intuitive based on its scales making interpretation more difficult.
How is the interpretation of the log (SalePrice) model different from the price model?
A one unit change in the independent variable (TotalFlrSF) that results in a change in the regression coefficient of the expected value of the dependent variable while all the predictors remain constant. In the discussion above, I explained the equation in a percentage basis for ease of understanding only.
Conclusion
Using log_SalePrice and log_TotalFlrSF provides a different perspective and method of calculation over the prior equations. While variable transformations are not as intuitive, they do provide a better model in terms of Adjusted RSquare and FValue. Additionally, the plots of the data are more aligned with what we would like to see – the residuals don’t look as random as the other plots, the QQ plot is light tailed. The predicted versus the sale price seems to fit the 45degree line better, Cook D’s values only has a few influential values that exceed .20. Based on the above, model ‘d’ is the best fitting model as it has the highest Adjusted R‐Square value and the largest F Value but have only accounted for 5.2% of the variability. Additionally, comparing the different models have different commonalities for instance, Model A and Model B have distinct grouping which does not appear to be random. Model C and Model D, the Q‐Q plot is lighter tailed. Model A and C each have a few values that exceed .20 and finally, in model B and D in Cook’s D, there are no values that exceed .20.
Part A.3: Correlate Continuous Variables
The below table reflects the correlation for SalePrice, log_SalePrice and sqrt_SalePrice for continuous variables:
TotalFlrSF has the largest correlation out of all three variables. This are logical based on the square footage and how it ties into sales price value by consumers.
The above plots are very. In SalePrice vs TotalFlrSF the line is not linear and the outliers assist in this nonlinearity. In the plot with SalePrice vs. log_TotalFlrSF we have a smooth curved line that is not straight. Log_SalePrice vs TotalFlrSF is not smooth but for a portion, seems to be linear with the 45degree line. Finally, in the TotalFlrSF vs sqrt_SalePrice we don’t have a smooth line but it does follow the 45degree for a larger portion of the plot than in the previous plot and thus, appears to be our best fitting model. The outliers seem to be having a large effect on how the curve changes direction which is in line with the analysis and Cook D’s provided insights.
Constant variance look at the regression line as look at SalePrice as TotalSqFt the variance increases and violates the constancy of variance. The log SalePrice the variance is more constant over the SalePrice so the assumption has now been met whereas in previous models it was not.
Part A.4: Log improvements
The conditions where the long variable transformations and the response variable (Y) could improve the model fit when the effect of the outliers is reduced. When applied to the independent variable (in this case, TotalFlrSF) to help create a linear relationship based on the dependent variable. Variable transformation on the data, while being perhaps less intuitive, assists in providing a more effective and useful model.
Analysis of Variance  
Source  DF  Sum of Squares 
Mean Square 
F Value  Pr > F 
Model  1  11722622  11722622  3327.03  <.0001 
Error  2928  10316657  3523.44838  
Corrected Total  2929  22039279 
Root MSE  59.35864  RSquare  0.5319 
Dependent Mean  416.26208  Adj RSq  0.5317 
Coeff Var  14.25992 
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t 
Intercept  1  30.16558  6.78295  4.45  <.0001 
Sqrt_TotalFlrSF  1  10.11869  0.17543  57.68  <.0001 
The standard equation is:
 Log_SalePrice = 230.76398 + .12408 x TotalFlrSF
Thus, a 1% increase in log_SalePrice will occur as one unit is increased in TotalFlrSF while holding all other predictors constant.
Comparison of Sqrt_SalePrice to log_SalePrice
ODS output with the variable transformations based on SQRT
Comparing the sqrt_SalePrice to log_SalePrice the Residual vs Predicted Value and RStudent plot appear to have more of a fan shape to it over the log_SalePrice ODS plots. Additionally, the QQ plot has a different fit to the line with the sqrt_SalePrice appearing to have a heavier tail over log_SalePrice. The Predicted Value in the sqrt_SalePrice appears to be more dispersed than in the log_SalePrice and has more outliers. Cook’s D has a very different scale to it indicating that the log_SalePrice outliers have more influence than the sqrt_SalePrice outliers. We also have a small increase in the Adjusted RSquare with an increase from 0.4944 in the log_SalePrice to .5317 in the sqrt_SalePrice. Finally, we see an increase in the FValue with the sqrt_SalePrice with a value of 3327.03 versus 2865.46 in log_SalePrice.
In comparing the residuals for sqrt_SalePrice versus log_SalePrice there is still a cluster of values occurring. The log_SalePrice appears to be more compressed but both are very similar in how the outliers appear.
The sqrt_SalePrice fit plot appears to be more compressed with fewer values outside of the 95% prediction limits. However, it has significantly more outliers. They are both still very similar but the sqrt_SalePrice has more of a fan shape to it.
Log Improvements Conclusion
Utilizing variable transformations to create log_SalePrice we have seen how it has impacted the SalePrice and TotalFlrSF. The residual and Predict Values changed shape to more of a fan shape when the values were transformed. Additionally, the log_SalePrice QQ plot fit the line significantly better using the log values over the nontransformed values. The residuals for log_SalePrice and log_TotalFlrSF appeared much more random then any of the other three models and while it still does not appear to be completely random as we would prefer, the different is easy to spot. The Fit Plot also improved with more data points remaining within the 95% prediction limits. The adjusted models also had the highest Adjusted RSquare and FValue. Evaluating the continuous variables with log_SalePrice was interesting in how the regression line smoothed out when the log_TotalFlrSF was used versus other variables that appeared choppy with drastic changes in direction. Adding the square root transformation variable to SalePrice was also conducted. While somewhat similar to log_SalePrice there were some modest changes to the scatter plots including a change in the Residial vs predicted values, the tails in the QQ plot, Cook’s D scale. A different shape of cluster in the residuals plot and the fit plot appearing more compressed but having more outliers. The Fvalue was larger in the sqrt_SalePrice but the Adjusted RSquare was larger in the log_SalePrice.
Part B Step 5: Outliers
To assist in finding outliers, the proc univariate normal plot was utilized.
proc univariate normal plot data=ames;
var SalePrice;
histogram SalePrice/normal;
This allows us to see the outliers and ranges:


Based on the above values, an evaluation on SalePrice using the 99% quantile value of $457,347 along with the 1% lowest value of $61,500. The outliers can easily be seen in the distribution and probability plot in the extreme high and low values along with the quantiles plot.
The twostandard deviation rule is that 95% of all observations will fall within two standard deviations. Assuming, that we are to apply the twostandard deviation rule to locate all observations that exceed two or more standard deviations from the mean and remove them to minimize issues with nonnormal distributions regardless of whether the observations are legitimate or not. I do not believe it should be used here as there are a lot of values that exceed two standard deviations from the mean. Removing the data that exceeds two standard deviations would result in a large loss of data and impact bias, accuracy and power of the results. If outliers are illegitimately included in the data i.e. they were errors then they should be removed.
Based on the above described quantile values of 1% and 99%, a code was assigned to these observations with some of the lowest and highest observations in SalePrice


After applying the codes and removing the observations that met the above criteria, we can see the results. Below the outlier_def codes are as follows:
 Code 1 reflects the values when SalePrice <= 61500
 Code 2 reflects SalePrice < 457347 & SalePrice > 61500
 Code 3 reflects SalePrice >= 457347
By using the extreme value, found 30 values and was confident in removing them as they are a small percentage of the data.
outlier_def  Frequency  Percent  Cumulative Frequency 
Cumulative Percent 
1  30  1.02  30  1.02 
2  2870  97.95  2900  98.98 
3  30  1.02  2930  100.00 
outlier_def=1  outlier_def=3  


Quantiles for Normal Distribution  
Percent  Quantile  
Observed  Estimated  
1.0  61500.0  5048.18 
5.0  87500.0  49394.14 
10.0  105250.0  78417.14 
25.0  129500.0  126913.30 
50.0  160000.0  180796.06 
75.0  213500.0  234678.82 
90.0  281356.5  283174.98 
95.0  335000.0  312197.98 
99.0  457347.0  366640.30 
Outliers Conclusion
As reflected above, the basis of determining which observations to remove were based on the SalePrice values at the 99% and 1% level. In reality we would not simply delete these values but flag them as observations to be investigated and evaluated upon on whether they are in error or are legitimate data points. Obviously, observations that are errors are removed and legitimate observations would be remain intact.
For the purposes of understanding the impact outliers can have on our model, we simply removed all of the values within this threshold. After we removed the selected outliers, we compared GrLivArea, TotalFlrSF and GrLivArea and TotalFlrSF, GrLivArea and MiscVal to determine whether the removal of the outliers helped the model to become more linear. In each case, the Adjusted RSquare and FValues decreased but the plots improved reflecting positive improvements.
Step 6: Cleaned Data
With the outliers removed, we now have a new quantiles table and distribution plot for SalePrice.
Quantiles (Definition 5)  
Level  Quantile 
100% Max  455000 
99%  405000 
95%  320000 
90%  275500 
75% Q3  212900 
50% Median  160000 
25% Q1  130000 
10%  107950 
5%  91000 
1%  75000 
0% Min 
62383 
Going back to the analysis on GrLivArea and applying the same data analysis to the data but this time with the cleaned data set we have the following results in the highest ranked variables:
Number in Model 
RSquare  Adjusted RSquare 
C(p)  Variables in Model 
1  0.4503  0.4501  574.7237  GrLivArea 
1  0.2179  0.2176  1815.424  MasVnrArea 
1  0.1274  0.1270  2298.385  YearBuilt 
1  0.1270  0.1267  2300.429  HouseAge 
1  0.0681  0.0677  2615.237  YearRemodel 
1  0.0636  0.0632  2638.890  TotalBath 
1  0.0608  0.0604  2654.258  GarageArea 
Comparison to the Full Data Set
Evaluating the same variables as used with the full data set. Using the variable GrLivArea
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t 
Intercept  1  30406  3352.95617  8.61  <.0001 
GrLivArea  1  99.03272  2.25147  43.99  <.0001 
Root MSE  59250  RSquare  0.4503 
Dependent Mean  178239  Adj RSq  0.4501 
Coeff Var  29.70718 
Equation & Interpret each Coefficient
The new equation for the GrLivArea is SalePrice = $30,406 x 99.03272 x GrLivArea. Which reflects that for each average unit increase in GrLivArea the Sale Price increases by $99.03. This seems reasonable and logical.
Analysis of Variance  
Source  DF  Sum of Squares 
Mean Square 
F Value  Pr > F 
Model  2  6.604169E12  3.302084E12  1258.76  <.0001 
Error  2867  7.520959E12  2623285390  
Corrected Total  2869  1.412513E13 
Root MSE  51218  RSquare  0.4675 
Dependent Mean  178437  Adj RSq  0.4672 
Coeff Var  28.70369 
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  13708  3958.59881  3.46  0.0005  0 
TotalFlrSF  1  14.03133  1.91693  7.32  <.0001  1.01991 
GrLivArea  1  96.26726  2.00021  48.13  <.0001  1.01991 
Equation & Interpret each Coefficient
The new equation for is:
 SalePrice = $13,708 + $14.03 x TotalFlrSF + $96.27 x GrLivArea
Which reflects that for each unit increase in GrLivArea the Sale Price increases by $14.03. This seems reasonable and logical.
For each unit increase in TotalFlrSF the SalePrice increases by $14.03 and $96.27 for each unit increase in GrLivArea. The above equation is different from the prior version where the GrLivArea was negative.
Variables:
The three variables are TotalFlrSF, GrLivArea and MiscVal.
Analysis of Variance  
Source  DF  Sum of Squares 
Mean Square 
F Value  Pr > F 
Model  3  6.607801E12  2.2026E12  839.75  <.0001 
Error  2866  7.517327E12  2622933420  
Corrected Total  2869  1.412513E13 
Root MSE  51215  RSquare  0.4678 
Dependent Mean  178437  Adj RSq  0.4672 
Coeff Var  28.70176 
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  13634  3958.83629  3.44  0.0006  0 
TotalFlrSF  1  14.21735  1.92331  7.39  <.0001  1.02684 
GrLivArea  1  96.19807  2.00094  48.08  <.0001  1.02079 
MiscVal  1  2.16107  1.83648  1.18  0.2394  1.00713 
Equation:
 Sales Price = $13,634 + $14.21735 x TotalFlrSF + $96.19807 x GrLivArea – $2.16107 x MiscVal
For each unit increase in TotalFlrSF the SalePrice increases by $14.22, a large decrease of $186.43 when the full dataset was utilized. Additionally, the decrease in SalePrice of $72.40 when the full dataset was utilized has now changed to an increase in salePrice of $96.20 per unit of GrLivArea and finally, the MiscValue declines by $2.16 per unit increase – this is a decline over the $9.15 decrease when the full dataset was utilized. MiscValue has little impact on SalePrice so this value is logical.
Summary Table:
Cleaned Data  Adj RSquare  F Value  P Value 
SalePrice = $30,406 x 99.03 x GrLivArea.  0.4501  1934.76  <.0001 
SalePrice = $13,708 + $14.03 x TotalFlrSF + $96.27 x GrLivArea  0.4675  1258.76  <.0001 
Sales Price = $13,634 + $14.21735 x TotalFlrSF + $96.19807 x GrLivArea – $2.16107 x MiscVal  0.4672  839.75  <.0001 
Full Data  
GrLivArea is SalePrice = $13,290 x 111.694 x GrLivArea.  0.4994  2922.59  <.0001 
Sale Price = $11,688 + $185.03078 x TotalFlrSF – $71.69170 x GrLivArea  0.5106  1528.94  <.0001 
Sales Price = $11,105 + $186.4389 x TotalFlrSF – $72.39791 x GrLivArea – $9.14957 x MiscVal  0.5146  1036.17  <.0001 
Analysis
GrLivArea: As we can see above, in each case the goodnessoffit decreased in each model both in terms on the Adjusted RSquare and FValue. The variable GrLivArea using the cleaned data still looks quite similar to the full data set. The residuals still have a fan shape to them reflecting that the normality assumption is violated. The QQ plot has smoother tails but still is not on the line and still have outliers. Cook’s D is similar. range has also increased. The residuals are more compressed with fewer outliers and the fit plot also has fewer outliers but is still fan shaped. and fit plot also look very similar with minimal changes.
TotalFlrSF and GrLivArea has more noticeable changes with some minor slimming in the tails of the QQ plot, fewer top outliers in the predicted versus average SalePrice with a bit more compression. Cook’s D is very similar looking with only two points >.20. The residual histogram appears quite different with more bars in the middle. Finally, there is less of a funnel shape and appearing more random in the residuals by regressors and fewer outliers.
In the TotalFlrSF, GrLivArea and MiscValue, we again seem the slimmer tails in the QQ plot, the predicted value vs sale value appears more compressed with fewer outliers. Cook D’s scale changed significantly from 3 to .4. The residual histogram again reflected more middle bars in the middle. Finally, the regressors are less fanned shape, appear more random and compressed with fewer outliers in the TotalFlrSF which is similar to GrLivArea. MiscValue has fewer outliers with the ones remining being closer to the line.
Overall, while the Adjusted RSquare and FValues have decreased, the graphical plots with the cleaned data appear to be in better shape. While the alterations are not major, all of the little differences in each of the plots seem to show that the relationship between the variables is improved. While, there is still room for improvement, the minor change in the removal of the data at the 99% and 1% mark have made some positive improvements. Thus, the transformation process was worthwhile in doing and provided some positive improvements but we still have issues to address.
Step 7: Influential Points
Using the threshold of .69 as the DFFITS value and removing the values that did not fit, the data changes as follows:
Before the DFFITS removal  After the DFFITS removal  


Thus, based on the above, eight observations were removed using the .69 threshold.
Analysis of Variance  
Source  DF  Sum of Squares 
Mean Square 
F Value  Pr > F 
Model  21  7.27839E12  3.4659E11  172.43  <.0001 
Error  2334  4.691511E12  2010073476  
Corrected Total  2355  1.19699E13 
Root MSE  44834  RSquare  0.6081 
Dependent Mean  178179  Adj RSq  0.6045 
Coeff Var  25.16229 
Additionally, the ODS is as follows:
Thus, by removing the additional eight observations, we can see how the Predicted Value vs Residual plot has fewer outliers and the scale has changed from 600,000 to 400,000. Additionally, in the QQ Plot the outliers at the bottom of the plot have disappeared. Similarly, the Predicted Values vs Sale price plot also shows a change in scale from 600,000 to 400,000 with the observations not more compressed on the 45degree line. Additionally, the Cook’s D plot scale has changed from .15 to a maximum of .025 with one value. Most are less than .015. We also can see the change in the Adjusted RSquare with an increase in value to .6045 from .5576 and have an Fvalue of 172.43.
Overall, the additional removal of the eight points, has again made a pretty big improvement to the model and thus, making it become more linear.
Step 8: Conclusion
Overall, transforming the variables and detecting and deleting outliers are two effective ways to assist in creating a better fitting model. Naturally, understanding the goals of the model are key in determining if they should be employed and to what degree. Additionally, the use of these techniques is more of an art than a science. If the model already meets the needs of its purpose than perhaps these methods are not required. However, if the model is not appropriate than utilizing these techniques to transform the model to become more linear is appropriate.
Employing a process to help determine which observations are outliers and requiring an evaluation can be easily accomplished as outlined within this report. While we have been heavy – handed in the removal of outliers, in reality, each outlier needs to be evaluated to see if it is valid or if it is an error. If an observation is an error, then it should be removed. However, removing legitimate observations will distort the model and should not be removed.
If done correctly, the transformation and deletion of outliers will benefit the model by creating a better, more appropriate model that is more linear. However, it does require additional processes and a certain finesse to determine the appropriate thresholds values. As easily observed from the above discussion, the choice of thresholds can drastically impact the number of observations it impacts and needs to be carefully selected and employed.
In summary, in Part A of this report we appended new variables, log_SalePrice and log_TotalFlrSF. We then used these new variables and paired them in four different methods: TotalFlrSF to predict SalePrice, log_TotalFlrSF to predict SalePrice, TotalFlrSF to Predict log_SalePrice and log_TotalFlrSF to predict log_SalePrice. In each combination of variables, an equation and interpretation of the coefficients was provided along with an analysis of the SAS generated Output Delivery System (ODS). Additional metrics such as Adjusted RSquare, FValue were included in determining the fit of the model. Initially, we found that the variable pair of log_SalePrice and log_TotalFlrSF was the ‘best’ fitting model based on an indepth analysis of both ODS plots and GoodnessofFit data. However, we also correlated TotalFlrSF with log_SalePrice and evaluated how well the observations fit the line.
Next, we evaluated the conditions where the log transformation improved the fit and also expanded to include another transformation but using square root. Again, an analysis on the ODS plots was provided along with the GoodnessofFit measures such as the Adjusted RSquare which reflected positive improvements over the log_SalePrice. We observed the difference in the residual and fit plots which were minor but still where the sqrt_SalePrice had some improvements over the log_SalePrice.
In the second part of the report, we focused on outliers. We chose the 99% and 1% level as our basis of determining which observations to remove based on the distribution and probability plot of SalePrice. As mentioned above, in reality we would not be this heavyhanded in their removal but instead evaluate each observation to determine whether it is an error or legitimate would be conducted. However, for the purposes of understanding the impact outliers can have on our model, we simply removed all of the values within this threshold. After we removed the selected outliers, we had a dataset of 2,870 observation that we compared GrLivArea, TotalFlrSF and GrLivArea and TotalFlrSF, GrLivArea and MiscVal to determine whether the removal of the outliers helped the model to become more linear. In each case, the Adjusted RSquare and FValues decreased but the plots improved reflecting positive improvements.
Finally, we undertook another method of removing outliers which was based on DFFITS which is a method to reflect how influential an observation is within a statistical regression. Based on our threshold of .69, we removed eight observations. Reducing our dataset down to 2,862. While the number of observations removed are pretty minimal, the results on the various ODS plots were easy to spot and the Adjusted RSquare improved to .6045. Thus, removing these influential observations clearly had an impact on transforming the model to become more linear.
Next steps could include an evaluation of the different regressors that have the largest influence on the regression equation and ensuring the chosen variables are logical. Additionally, attaining additional data to validate the model would be helpful in understanding how well the model performs in predicting sales prices. In some cases, data is split so that data can be used both in the creation of the model and the determine how predictive if it on the remaining data. While there are pros and cons to the methods, in our study of the Ames housing data, we did not employ this technique and hence, additional data would be helpful in determining the performance of our model.