Data Variables and Analytical Models

Data Variables and Analytical Models

Before diving in to a statistical analysis of any dataset, spending the requisite time to understand the data, checking the quality and taking a look ‘under the dash’ is essential.  Below, we will examine the data variables and analytical models on a housing prices as a first step in predicting the sale price of houses in Ames, Iowa.  The dataset is comprised of approx. 45 categorical variables i.e. does the house have a fence or lot shape, approx. 20 continuous variables including living area, basement area and approximately 15 in between variables such as year sold.  The full dataset is here and the definitions on the coding can be found here

Evaluating the data variables and analytical models, we will take a processed approached at evaluating the variables with the intent to developing a model to predict sales price in Ames, Iowa.  We’ll start off by looking the variables and what they represent and evaluating the quality of the data.  A search for ‘fishy’ values and outliers will then be conducted and new variables (TotalSqFt) will created to be used throughout out model.  A evaluation of the correlation of variables will be conducted to determine linear relationships along with a plot of the highest and lowest correlated variables to provide a perspective of the different patterns.  A Locally Estimated Scatter plot Smoother (LOESS) will also be provided.  A categorical variable analysis will also be provided for comparison and again, to determine whether a linear relationship exists.     Finally, a brief summary with additional insights on the potential concerns of the data will be provided.

Sales Price

Sales prices that are out of range or fishy:

• All housing prices are greater than zero
• Low values are in line with the age of the house, square footage, overall quality rating, etc.

There is really no benefit of printing all 2,930 observations when one can sort, select, graph select different records for various purposes and review the summary of the data.
A few outliers that should be reviewed:

• A 334 sq ft home seems unusually small and could be missing a number and thus, should be 1,334 (SID 1556 / PID 911102170).
• Garage built in 2011 yet the latest sales date is 2010 (SID 398/ PID 527405130)
• Zoning has 168 non-residential records that include agricultural land, floating village and industrial sales. Agricultural land and industrial sales should likely be excluded as they do not meet our ‘housing’ criteria. Additionally, a floating village needs to be carefully considered as the sales price is likely to be very different when compared to traditional home ownership – there is no land, lot frontage, garage, etc. and the type of construction and on-going costs are likely to be very different.
• A large portion of the sales had centralair with 197 sales without.
• Exterior1 has houses with ‘Asbestos Shingles’ which could have implications on the sales price
• ‘LotFrontage’ is missing from 490 of the 2,930 records and thus, should be careful using this variable in determining price (see below for a small sample). Based on the ‘LotArea’ there should be values and are not null values except when validated such as a floating village (see above).

• ‘OverallQual’ has a maximum value of 7 with a range of 1 to 10.
• One record is missing basement data (SID 1342/ PID 903230120)
• The ‘TotalFlrSF’ and ‘SalePrice’ seems fishy for some records:
o Although the first record shown below, seems to be an error as the ‘BsmtFinSF1’ exceeds the ‘GrLivArea’ and does not add up to the ‘TotalFlrSF’
o Houseage is -1 which could be legitimate i.e. it is a pre-sale but should be confirmed especially given the SalesPrice and TotalFlrSF

Conclusion on Sale Prices

There are a variety of variables and observations that should be carefully reviewed to ensure their validity and application to the sales price model. I find the houses zoned as floating village to be a concern due to their variance from normal houses and related maintenance and/or moorage costs. No surprise, some of these floating homes are related to the missing Lot Frontage data. Additionally, houses values with Asbestos could be negatively impacted and should be noted as something to watch for especially with new environmental laws and the potential additional health hazards that are not yet known. Additional items to evaluate further for their validity have been mentioned and should be carefully considered before proceeding.

Pearson Coefficients

Produce Pearson Correlation coefficients and scatterplot matrix of the potential continuous predictor variables with the sales price.
The predictor variables with the strongest relationship with the response variable are as follows:

Comments on the predictor variables with the strongest linear relationships with the response variable Y

  • TotalFlrSF – **added** Logical result and association with sales price
  • FirstFlrSF & SecondFlrSF – Logical result and association and is related to TotalFlrSF
  • GrLivArea & TotalBsmtSF – Related to TotalFlrSF and therefore should be careful to combine multiple variables that measure the same things.
  • TotalFullBath – Logical result and association with sales price
  • Houseage – Again, a negative correlation is logical along with the result.
  • YearBuilt – Related to houseage and therefore, need to be careful not to combine variables measuring the same things
  • YearRemodel & TotRmsAbvGrd – Logical result and association with sales price
  • MsnVnrArea – Logical association with sales price
  • BsmtFinSF1 & BsmtFinSF2 – With TotalFlrSF this is a related variable
  • LowQualFinSF – Logical result and association with sales price
  • LotFrontage – Logical association with sales price
  • WoodDeckSF – Logical but would have thought other variables would have been more important
  • LotArea – Logical association with sales price but also related to LotFrontage
  • TotalHalfBath & TotalFullBath – **added** Logical association with sales price and overall ranking
  • TotalBath – **added** Logical but is associated with several other variables so need to ensure this is kept in mind with analyzing
  • TotalPorchSF – **added** Logical association with sales price and overall ranking
  • BsmtUnfSF – Logical result
  • BedroomAbvGr – I would have thought this to be a larger number but still significant
  • PoolArea – Logical result
  • MoSold & YrSold– Results are logical with the time of year not playing a significant factor in price
  • MiscValue – Results are logical

Notice about the Relationship between the numeric correlation measure and graphical relationship in the Scatterplot?

In almost all variables, there are outliers which could impact the results but in all cases, there appears to be some type of relationship and not just complete randomness.  However, outliers need to be analyzed and determined if corrective action is required as they can impact the slope and least-squares fit.

Best Predictor:

I think total square footage will be the best predictor.  It is not a particularly large city with a population around 60,000 and the city area is not all that large so neighborhoods (and in turn, housing prices) are not likely impacted by geographic issues such as bridges to cross or limited by space i.e. land is limited due to a large portion being surrounded by water and thus, buyers are more likely to focus on the size of the living area and quality of the house rather than location.  With the square footage, owners have the option of renovating to use the space differently but houses with asbestos are less likely to be renovated due to the additional costs and associated health hazards.

Worst Predictor:

I think the worst predictor will be the type of materials in the alleyway.  I think the type of materials in the alley are not a value item that people will pay more or less for.  I think they will be more focused on assessing the value of the house based on its quality.  Additionally, a lot of the homes may not even have an alleyway.


See above bullet list on the association between the variables.  There are high correlations in the set of potential predictor variables as reflected above with TotalFlrSF at .71359, GrLivArea at .70678, GarageArea at .64040, TotalBsmtSF at .63228, FirstFlrSF at .62168, TotalFullBath at .592, houseage at -.55891, YearBuilt at .55843, YearRemodel at .53297.

Correlation coefficient sufficient to make a decision regarding predictor variables?

No, because correlation does not mean causation. A strong relationship between variables does not mean they are related in any way.  The example in the readings of the relationship between mental defectives, radio licenses and first name of US Presidents is a great example (Montgomery, D.  2012.  Introduction to Linear Regression Analysis.  P. 44).

Conclusion on Pearson Coefficients

The highest correlation variables are no surprise with square footage being a priority in terms of Total, Living, Basement, and First Floor.  Garage Area, number of bathrooms and the age of the house and when it was remodeled are also important.  Unused variables or variables that may be confusing have the lowest correlation such as BsmtFinSF2 and Misc Val.  The simple statistics overview provides a nice summary and allows for a quick review of all of the variables which allows us to spot potential areas to investigate such as a house age with -1 year.  While this data may be legitimate i.e. a prebuilt house, verifying its validity along with any other outliers should be completed to ensure the best possible results.  The scatter plot matrix is a great way to view the outliers.  The predictor values all seem logical and related to some degree but need to be careful on using multiple associated variables such as house age and year built.  Naturally, just because there is a high correlation between variables it doesn’t mean causation so one needs to always be cautious of this and ensure the readers of any related outcomes understand its importance.


Scatter plot for the highest, lowest and closest correlation to .5.

Conclusion on Plots

The scatter plots for TotalFlrSF, BsmtFinSF2 and TotRmsAbvGrd are all really helpful in determining the patterns within the various variables and on the SalePrice.  It is easy to see how the correlation in TotalFlrSF is so much higher than BsmtFinSF2.  Additionally, they also help spot outliers.  The TotRmsAbvGrd line is more difficult to visualize and hence, how LOESS will be helpful to utilize.



Interested in the LOESS Scatterplots and what they show us:

We are interested in Loess because it provides smoother results that minimizes the variance of the residuals or prediction error.  It reduces boundary bias and can reduce the impact outliers has on the results.  Loess uses a formula to specify the response of a single predictor variable while the predict function retrieves the fitted values (UOregon, 2017.  Retrieved from  WE use LOESS to smooth data values where it can be difficult to see relationships due to weak interrelationships or our ability to see a line that best fits the data or when the least squares doesn’t create a line with a good fit to the data.  As we can see Total Sq Ft increases with price but hits a peak whereby it becomes less of a factor in the price but this is based on a few data points that still need to be investigated to ensure their validity.

Conclusion on LOESS

The LOESS scatterplot is really helpful in seeing interrelationships that may be difficult to see within a chart.  Additionally, it reduces the impact of the outliers which are currently reflected within the data and the associated charts.

Categorical Variables

The plots of the categorical variables:

Conclusion on Categorical Variables

Viewing a variety of histograms on data variables and analytical models and specifically, the categorical data is helpful in seeing the popularity of categories along with the frequency.  This will allow us to choose categories that are more likely to have an impact on the sales price over other variables that have lesser diversity and not likely to be able to adequately impact the sales price for instance, the type of alleyway.   However, we still need to be aware of the data, market trends and other economic issues that may be impacting any trends.  For example, perhaps a neighborhood such as North Ames had a new building development targeting social housing that was subsidized by the government. Thus, effectively reducing the actual sales price.  Naturally, we can use other variables to validate such programs but is used an example to ensure that we don’t just look at a graph with an extraordinary high category and assume all things are equal.

Sort:  Three Categorical Variables

Data Variables and Analytical Models based on three categorical values as per the histograms above.

In terms of the neighborhood, we generally increased values in neighborhoods in high demand – specifically North Ames, College Creek and Old Town.  Similarly, we see increased prices for one and two-story houses in style.  Finally, we similar increased values in the types of foundation with cinder block and poured concrete.

Conclusion on Three Categorical Variables

Beyond looking at just the histograms, reviewing the actual categorical data with the volume and related mean, standard deviation, minimum and maximum numbers it is easy to understand the validity of the results or to provide perspective on those categories that are unlikely to play a large role in the sales price i.e. a low volume with a high mean, min., max. and range of values.  It is also easy to determine where the mean of one category is significantly different from another category which may impact demand and in turn, sales price.

Graphics and Summaries

Below are a few samples of the summary statistics and charts.

Sales Price Summary Statistics



The charts and summation of data is interesting.  The mean price is $180,796 and a median of $160,000 with a standard deviation of $79,887.  Standard error mean is $1,475.  The standard deviation on the size of the lot is 7,880 sq ft which I think is quite high.  TotalFlrSF has a mean of 1,495, a median of 1,440 with a standard deviation of 503.  Mean age of a house is 36.4 years with a median of 34 and a standard deviation of 30.  Nothing really surprising jumps out of this data and is aligned with the prior analysis and discussion.

Conclusion on Graphics and Summaries

The charts in SAS are surprisingly great with a lot of useful functionality and variety.

Data Variables and Analytical Models Summary

The data variables and analytical models including the EDA do not suggest any potential difficulties or concerns in build the model.  There are multiple variables that are related to response Y, SalePrice and should assist in building a valid model that predicts housing prices.  All of the issues raised in the above discussion including outliers and variables that may need careful consideration such as asbestos can be addressed in the normal model building process.  Additionally, being aware of the associations between variables have been highlighted to help reduce related errors.  I’m not familiar enough yet with transformations to suggest that they be employed to stabilize the variance but from my prior class, believe that they might be useful to stabilize the variances.

Using SAS to review and analyze the data is really helpful and found that gaining insights can be quite easily achieved either via the sort, statistical summaries and various graphical methods.  The combination is very powerful.

Leave a Reply