Principal Components Analysis
Principal Components Analysis
Utilizing a stock portfolio data set and the Principal Components Analysis as a method in reducing dimension and as a remedial measure for multicollinearity in Ordinary Least Squares regression. Beginning with the data, we will transform the variables into log values to explain the variation in the log-returns of the stocks and market index. Next, correlations between the individual stocks and the market index will be conducted followed by the transposing the data tables into a format and various plot to visualize the correlations will be shown. An analysis on the principal components and which components to keep based on our requirements will be applied to our regression model and later, applied to the data which will be split into a training and test set. An analysis on fitting a regression model to the individual stocks on a goodness-of-fit and determination on the existence of multicollinearity issues will be provided. Similarly, using eight principal components, a fitted regression model and an analysis on the variance inflation factors will be discussed leading into final determination on the best model.
Part 1: Data & Transformation of Data
Using the dataset, stock_portfolio we load the data into SAS where we can view the daily closing price for 20 stocks and indexed funds from Vanguard(VV) beginning Jan 3, 2012 to Dec 31, 2013. New variables are created to normalize the data and calculate the daily return for each stock by using the log function for example, return_AA = log(AA/lag1(AA)) with lag being the difference of one day between the closing prices. The response or dependent variable, ‘VV’ is the variation in the log-returns of the market index which allows us to explain the variation on the individual stock returns. If we did transform the stock prices, the stocks would be at their closing dollar value which does not provide a basis for comparison on their daily return as the ‘measuring stick’ would be different for each stock.
Using log-returns of each stock we will explain the variation in the log-returns of the market index using both linear regression and principal components analysis.
The results show 502 observations, 22 variables. A quick overview:
Obs | Date | AA | BAC | BHI | KO | MMM | MPC | VV | return_AA | return_BAC | return_BHI | return_CVX | return_DD | return_DOW | response_VV |
1 | 03JAN2012 | 9.23 | 5.8 | 51.02 | 35.07 | 83.49 | 33.41 | 58.18 | . | . | . | . | . | . | . |
2 | 04JAN2012 | 9.45 | 5.81 | 51.53 | 34.85 | 84.18 | 33.76 | 58.25 | 0.023556 | 0.001723 | 0.009946 | -0.001723 | 0.010906 | 0.005356558 | 0.001202439 |
3 | 05JAN2012 | 9.36 | 6.31 | 50.82 | 34.68 | 83.8 | 31.92 | 58.44 | -0.009569 | 0.082555 | -0.013874 | -0.009850 | -0.006829 | 0.006323869 | 0.003256494 |
4 | 06JAN2012 | 9.16 | 6.18 | 51.26 | 34.46 | 83.37 | 31.66 | 58.32 | -0.021599 | -0.020817 | 0.008621 | -0.007267 | -0.014234 | 0.005954368 | -.002055499 |
5 | 09JAN2012 | 9.42 | 6.27 | 51.58 | 34.46 | 83.87 | 30.96 | 58.45 | 0.027989 | 0.014458 | 0.006223 | 0.010836 | 0.008435 | -.000329870 | 0.002226600 |
Part 2: Correlation
The response variable (VV) is then used with all of the predictors to calculate the PearsonCorr correlation. The output reflects that all correlations are highly significant with the lowest correlation value at .44350 on return_DPS and the highest correlation of .76838 on return_MMM. It is testing if the correlation is equal to zero along with the p value reflecting a significance level of <.0001 and therefore, not equal to zero. Thus, at the .0001 level we know that the results are not accidental and that the results are not just by chance.
Pearson Correlation Coefficients, N = 501 Prob > |r| under H0: Rho=0 |
|||||||
return_AA | return_BAC | return_BHI | return_CVX | return_DD | return_DOW | return_DPS | |
response_VV | 0.63241 <.0001 |
0.65019 <.0001 |
0.57750 <.0001 |
0.72090 <.0001 |
0.68952 <.0001 |
0.62645 <.0001 |
0.44350 <.0001 |
return_GS | return_HAL | return_HES | return_HON | return_HUN | return_JPM | return_KO | |
response_VV | 0.71216 <.0001 |
0.59750 <.0001 |
0.61080 <.0001 |
0.76838 <.0001 |
0.58194 <.0001 |
0.65785 <.0001 |
0.59980 <.0001 |
return_MMM | return_MPC | return_PEP | return_SLB | return_WFC | return_XOM | ||
response_VV | 0.76085 <.0001 |
0.47312 <.0001 |
0.50753 <.0001 |
0.69285 <.0001 |
0.73357 <.0001 |
0.72111 <.0001 |
Obs | correlation | tkr |
1 | 0.63241 | AA |
2 | 0.65019 | BAC |
3 | 0.57750 | BHI |
4 | 0.72090 | CVX |
5 | 0.68952 | DD |
6 | 0.62645 | DOW |
7 | 0.44350 | DPS |
8 | 0.71216 | GS |
9 | 0.59750 | HAL |
10 | 0.61080 | HES |
11 | 0.76838 | HON |
12 | 0.58194 | HUN |
13 | 0.65785 | JPM |
14 | 0.59980 | KO |
15 | 0.76085 | MMM |
16 | 0.47312 | MPC |
17 | 0.50753 | PEP |
18 | 0.69285 | SLB |
19 | 0.73357 | WFC |
20 | 0.72111 | XOM |
Part 3: Transposing to a Long Table with Stock Ticker & Correlation
Similar to part 2 reflecting the correlation but in a different format referred to as ‘long’ format with the ticket added as a separate line in SAS: tkr = substr(_NAME_,8,3); with the 8 is the position with three positions for the ticker symbol. While the correlation values do not change, it is easier to examine all of the values but with the understanding that all of the p values meet our required significance levels.
Part 4: Adding Sector Information
Similar to Part 3, we are adding information to the table. In this part, we are adding the industry sector. SAS code is again, similar to part 3 with “input tkr $ 1-3 sector $ 4-35;”. VV is the market index whereas all other items are sectors with ticket symbols. Including the sector codes provides context and insight into the different stocks as we can now see that the lowest correlated stock is DPS but that it is in the soft drink sector but can also scan the list for other ticker symbols within the soft drink sector and see that the ticker symbol PEP also has a similar correlation of .50753 and KO at .59980. Similarly, the highest correlated stock as noted above was MMM (correlation of .76085) which is in the Manufacturing sector and is similar to another manufacturing stock HON with a correlation of .76838.
Obs | correlation | tkr | sector |
1 | 0.63241 | AA | Industrial – Metals |
2 | 0.65019 | BAC | Banking |
3 | 0.57750 | BHI | Oil Field Services |
4 | 0.72090 | CVX | Oil Refining |
5 | 0.68952 | DD | Industrial – Chemical |
6 | 0.62645 | DOW | Industrial – Chemical |
7 | 0.44350 | DPS | Soft Drinks |
8 | 0.71216 | GS | Banking |
9 | 0.59750 | HAL | Oil Field Services |
10 | 0.61080 | HES | Oil Refining |
11 | 0.76838 | HON | Manufacturing |
12 | 0.58194 | HUN | Industrial – Chemical |
13 | 0.65785 | JPM | Banking |
14 | 0.59980 | KO | Soft Drinks |
15 | 0.76085 | MMM | Manufacturing |
16 | 0.47312 | MPC | Oil Refining |
17 | 0.50753 | PEP | Soft Drinks |
18 | 0.69285 | SLB | Oil Field Services |
19 | 0.73357 | WFC | Banking |
20 | 0.72111 | XOM | Oil Refining |
While the above table with the sector provides us with a lot more context, visually viewing the results is even easier showing the ticker symbol and value.
However, in the above vbar chart, matching the sectors to each stock ticker is a not very efficient. Thus, it would be easier for the reader to view the information with the sectors being grouped together.
In the above vbar chart, the reader can easily compare the different stock tickers within each industry and get a feel for the different correlations between the stock and the Vanguard index (VV). For instance, manufacturing correlations are very similar whereas oil refining stocks have the largest range. However, mentally analyzing the vbar chart, it is difficult to calculate the mean correlation for each sector.
Obs | sector | Freq | mean_ correlation |
1 | Banking | 4 | 0.68844 |
2 | Industrial – Chemical | 3 | 0.63264 |
3 | Industrial – Metals | 1 | 0.63241 |
4 | Manufacturing | 2 | 0.76461 |
5 | Oil Field Services | 3 | 0.62262 |
6 | Oil Refining | 4 | 0.63148 |
7 | Soft Drinks | 3 | 0.51694 |
Thus, a table on the different sectors in both a table and vbar are easy to visual and compare the different sectors and with the frequency understand how many stocks are within each sector.
As we can see the highest correlation is Manufacturing with an average of .764 and the lowest being Soft Drinks with a correlation of .516.
Part 5: Principal Components Analysis
Eigenvalues of the Correlation Matrix | ||||
Eigenvalue | Difference | Proportion | Cumulative | |
1 | 9.63645075 | 8.09792128 | 0.4818 | 0.4818 |
2 | 1.53852947 | 0.19109235 | 0.0769 | 0.5587 |
3 | 1.34743712 | 0.39975791 | 0.0674 | 0.6261 |
4 | 0.94767921 | 0.15217268 | 0.0474 | 0.6735 |
5 | 0.79550653 | 0.12909860 | 0.0398 | 0.7133 |
6 | 0.66640793 | 0.10798740 | 0.0333 | 0.7466 |
7 | 0.55842052 | 0.04567198 | 0.0279 | 0.7745 |
8 | 0.51274854 | 0.01590728 | 0.0256 | 0.8002 |
9 | 0.49684126 | 0.03250822 | 0.0248 | 0.8250 |
10 | 0.46433304 | 0.03089374 | 0.0232 | 0.8482 |
11 | 0.43343929 | 0.02568332 | 0.0217 | 0.8699 |
12 | 0.40775598 | 0.05667006 | 0.0204 | 0.8903 |
13 | 0.35108592 | 0.01597897 | 0.0176 | 0.9078 |
14 | 0.33510695 | 0.03813712 | 0.0168 | 0.9246 |
15 | 0.29696984 | 0.02068234 | 0.0148 | 0.9394 |
16 | 0.27628750 | 0.01692712 | 0.0138 | 0.9532 |
17 | 0.25936037 | 0.01730228 | 0.0130 | 0.9662 |
18 | 0.24205809 | 0.02020002 | 0.0121 | 0.9783 |
19 | 0.22185807 | 0.01013445 | 0.0111 | 0.9894 |
20 | 0.21172363 | 0.0106 | 1.0000 |
Creating a new dataset by excluding the dependent variable – market index we can calculate the principle components. Ignoring the response variables and focusing on the 20 predictor values/return values we can reduce the number of dimensions of the 20 variables using the principal component. Evaluating only the return data. SAS allows us to quickly match all of our log variables. We can then use these to calculate the eigenvectors. The number of eigenvectors will match the number of variables.
proc princomp data=return_data out=pca_output outstat=eigenvectors plots=scree(unpackpanel);
‘out=’ will have all of the outputs from the Principal Components Analysis whereas ‘outstat’ will only contain specific statistics captured which in this case is eigenvectors. Prin1 needs to be carefully reviewed as the same term is used later with all observations
Prin1 is the weights of the first principal component with the size of the Eigenvalue will be the same number of variables. The values are the variable weights of all of the variables reflecting a linear combination of the eigenvalues and the cumulative total.
Variance Explained plot shows the number of principle components to explain the and the number of principle components you want to keep to explain the desired level of variance.
Determining the ‘right’ number of components to keep depends on the desired threshold for explaining the variation. As explained in red text, if the threshold was 80%, eight principal components would be required. If we wanted to explain 70% of the variation, we would want to keep four principal components and if we wanted to explain 90% of the variation, we would keep 13 of the principal components. Assuming 80% is our threshold, we will want to keep eight principal components.
Evaluating whether there are clusters of data, a review of the Principal value scores, reflects that they are completely independent of each other. Thus, the principal component scores can be used as another set of variables without having any multicollinearity issues as there is no correlation.
Graphically, we can view the first two principle component vectors to evaluate whether there is any relationship within the data. This is usually accomplished by reviewing the graph in a quadrant-by-quadrant analysis.
Similar to the analysis above, there is a cluster of stock tickets in the first quadrant that are from the soft drink industry which are heavily weighted in Prin2 but lightly weighted in Prin1. Prin1 is mostly in the fourth quadrant is heavily weighted all of the other industries compared to the soft drinks. No surprise that we see the manufacturing tickers close together.
Part 6: Regression Model using Training and Test Data Sets
Building a regression model using the original variables and the principal components with the response variable being the index to predict which model is better – the principal component or the raw data. Split the data into a test and training set with a 70/30 split between the training and test data. Created a cross validation data. The table on the right reflects the observations used in the training data set.
Part 7: Build a Regression Model
Using the Training Data Set:
Using the training data and the test data to build the model. The regression model using the raw predictor variables, the diagnostic summary:
|
|
The data reflects 20 predictors are being used with an Adjusted R-Square of .8919 which is good. The F value of 140.04 will be used to compare against the other models and our p value is still within our threshold at <.0001.
The parameter estimates:
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | 0.00008640 | 0.00014092 | 0.61 | 0.5403 | 0 |
return_AA | 1 | 0.01769 | 0.01317 | 1.34 | 0.1802 | 2.11490 |
return_BAC | 1 | 0.03198 | 0.01165 | 2.75 | 0.0064 | 3.10927 |
return_BHI | 1 | -0.00111 | 0.01323 | -0.08 | 0.9333 | 2.62997 |
return_CVX | 1 | 0.04907 | 0.02536 | 1.93 | 0.0539 | 3.07524 |
return_DD | 1 | 0.04674 | 0.02037 | 2.29 | 0.0224 | 2.51406 |
return_DOW | 1 | 0.03642 | 0.01162 | 3.14 | 0.0019 | 1.88893 |
return_DPS | 1 | 0.03670 | 0.01679 | 2.19 | 0.0295 | 1.54768 |
return_GS | 1 | 0.04849 | 0.01555 | 3.12 | 0.0020 | 3.10450 |
return_HAL | 1 | 0.00948 | 0.01466 | 0.65 | 0.5184 | 3.08758 |
return_HES | 1 | 0.00359 | 0.01092 | 0.33 | 0.7425 | 2.10199 |
return_HON | 1 | 0.12213 | 0.01924 | 6.35 | <.0001 | 2.73505 |
return_HUN | 1 | 0.02712 | 0.00836 | 3.24 | 0.0013 | 1.79852 |
return_JPM | 1 | 0.00902 | 0.01708 | 0.53 | 0.5979 | 3.36439 |
return_KO | 1 | 0.07903 | 0.02226 | 3.55 | 0.0004 | 1.93633 |
return_MMM | 1 | 0.09796 | 0.02646 | 3.70 | 0.0003 | 2.98277 |
return_MPC | 1 | 0.01673 | 0.00809 | 2.07 | 0.0394 | 1.32999 |
return_PEP | 1 | 0.02911 | 0.02231 | 1.30 | 0.1929 | 1.68825 |
return_SLB | 1 | 0.03776 | 0.01709 | 2.21 | 0.0279 | 3.13690 |
return_WFC | 1 | 0.07587 | 0.01848 | 4.10 | <.0001 | 2.59492 |
return_XOM | 1 | 0.05467 | 0.02697 | 2.03 | 0.0435 | 2.98393 |
Reviewing the table, it seems weird that our F value is significant yet, the t-values are all non-significant. Thus, there is a chance that we have multicollinearity taking place. As such, we evaluate the Variance Inflation Factor (VIF) which measures the dependencies among the regressors. VIFs values that exceed 5 or 10 are an indication that the associated regression coefficients are poorly estimated due to multicollinearity. With our highest value of 3.36439 which is on the low end of multicollinearity, we do not have an issue. Some of the p values are also not significant.
As we can see above, the data fits the line well in the Q-Q Plot, Cook’s D has low values with a maximum of .15. Residuals are scattered in a random matter.
Using ALL of the Data:
Using all of the data, the diagnostic information on the second model is as follows:
|
|
Similar to the first regression model we still have 20 regressors. Comparing the goodness-of-fit statistics using all of the data, Adjusted R-Square is very similar with a value of .8768 (vs .8919) and an improvement in the F value to 178.96 (vs. 140.04).
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | 0.00009953 | 0.00012133 | 0.82 | 0.4124 | 0 |
return_AA | 1 | 0.01538 | 0.01040 | 1.48 | 0.1399 | 2.02627 |
return_BAC | 1 | 0.02723 | 0.00970 | 2.81 | 0.0052 | 2.68654 |
return_BHI | 1 | 0.01604 | 0.01161 | 1.38 | 0.1678 | 2.65137 |
return_CVX | 1 | 0.05742 | 0.02068 | 2.78 | 0.0057 | 2.92019 |
return_DD | 1 | 0.01003 | 0.01625 | 0.62 | 0.5371 | 2.44149 |
return_DOW | 1 | 0.03600 | 0.01069 | 3.37 | 0.0008 | 1.96589 |
return_DPS | 1 | 0.05659 | 0.01493 | 3.79 | 0.0002 | 1.52563 |
return_GS | 1 | 0.03434 | 0.01358 | 2.53 | 0.0118 | 3.19781 |
return_HAL | 1 | -0.00198 | 0.01210 | -0.16 | 0.8703 | 2.91992 |
return_HES | 1 | 0.00439 | 0.00969 | 0.45 | 0.6504 | 2.09606 |
return_HON | 1 | 0.10707 | 0.01608 | 6.66 | <.0001 | 2.45588 |
return_HUN | 1 | 0.02867 | 0.00722 | 3.97 | <.0001 | 1.74391 |
return_JPM | 1 | 0.02224 | 0.01329 | 1.67 | 0.0948 | 2.87596 |
return_KO | 1 | 0.09425 | 0.01847 | 5.10 | <.0001 | 1.98165 |
return_MMM | 1 | 0.10928 | 0.02202 | 4.96 | <.0001 | 2.68494 |
return_MPC | 1 | 0.01079 | 0.00702 | 1.54 | 0.1251 | 1.37671 |
return_PEP | 1 | 0.02092 | 0.02034 | 1.03 | 0.3043 | 1.72066 |
return_SLB | 1 | 0.04851 | 0.01453 | 3.34 | 0.0009 | 3.25898 |
return_WFC | 1 | 0.07738 | 0.01580 | 4.90 | <.0001 | 2.53263 |
return_XOM | 1 | 0.05797 | 0.02301 | 2.52 | 0.0121 | 2.94983 |
As reflected in the table above, the maximum VIF value is 3.19781 which is a decrease from 3.36439. Also, some of the p values are not significant. Thus, this regression model appears to be performing very similar to the training model.
Evaluating the diagnostic plots, everything is very similar to the analysis provided in the first regression model using the training data with a minor change in the scale of the Cook’s D plot with the scale changing from .15 to .125 and a more compressed scale on the Q-Q plot.
Overall, the diagnostic plots look good. Based on the above, this model does not have multicollinearity issues. For ease of comparison, the mean square error and mean absolute error will be discussed in part 8.
Part 8: Build a Regression Model Using Principal Components
Principal component analysis is used to reduce the dimensionality of a data set consisting of a large number of interrelated variables while trying to retain the variation in the data set. By transforming to a new set of variables, the uncorrelated principal components are ordered so that the first few retain most of the variation present in all of the original variables (Jolliffe, I.T., 2010. Principal Component Analysis, 2nd edition). Using the top eight principal components to build the model based on the training and test data to determine which model fits better.
Training Dataset:
Fitting a regression model to eight selected principal components, the diagnostic information is as follows:
|
|
Comparing the above information to the prior models, we can see that the Adjusted R-Square is similar to the other models (.8919 and .8768) and an increase in the F Value to 337.13 (an increase from 140.04 and 178.96) in the prior two models using all 20 regressors.
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | 0.00075978 | 0.00014045 | 5.41 | <.0001 | 0 |
Prin1 | 1 | 0.00231 | 0.00004519 | 51.05 | <.0001 | 1.00527 |
Prin2 | 1 | 0.00032245 | 0.00011425 | 2.82 | 0.0051 | 1.00868 |
Prin3 | 1 | 0.00070635 | 0.00012322 | 5.73 | <.0001 | 1.00861 |
Prin4 | 1 | 0.00030481 | 0.00014536 | 2.10 | 0.0368 | 1.00636 |
Prin5 | 1 | -0.00017356 | 0.00015516 | -1.12 | 0.2641 | 1.00297 |
Prin6 | 1 | 0.00000315 | 0.00017108 | 0.02 | 0.9853 | 1.00766 |
Prin7 | 1 | -0.00010331 | 0.00018604 | -0.56 | 0.5791 | 1.02315 |
Prin8 | 1 | -0.00040760 | 0.00020293 | -2.01 | 0.0454 | 1.02271 |
From the above table, we see some of the p values/predictors are not significant. This is especially true in the independent variables four through eight.
The diagnostic plots are very similar to the prior model.
Test/Full Dataset:
Similar to the above but now applying the eight principal components to the full data set:
|
|
Parameter Estimates | ||||||
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| | Variance Inflation |
Intercept | 1 | 0.00075200 | 0.00012218 | 6.15 | <.0001 | 0 |
Prin1 | 1 | 0.00227 | 0.00003940 | 57.55 | <.0001 | 1.00000 |
Prin2 | 1 | 0.00045862 | 0.00009860 | 4.65 | <.0001 | 1.00000 |
Prin3 | 1 | 0.00050372 | 0.00010536 | 4.78 | <.0001 | 1.00000 |
Prin4 | 1 | 0.00015525 | 0.00012563 | 1.24 | 0.2171 | 1.00000 |
Prin5 | 1 | -0.00018175 | 0.00013713 | -1.33 | 0.1856 | 1.00000 |
Prin6 | 1 | 0.00008231 | 0.00014982 | 0.55 | 0.5830 | 1.00000 |
Prin7 | 1 | -0.00003961 | 0.00016367 | -0.24 | 0.8088 | 1.00000 |
Prin8 | 1 | -0.00035470 | 0.00017080 | -2.08 | 0.0383 | 1.00000 |
In comparison to the test data, the first four p values/predictors are now significant. The VIF values are all one’s because all the principal components are independent with a correlation of 0, therefore minimum VIF is 1. A VIF of 1 is ideal as it reflects that the regressors are completely unrelated.
The above plots are also very similar. The residual plot has a few outliers, the Q-Q plot fits the line and the Cook’s D is small with a scale maximum of 0.15.
Overall, there are only a few minor differences between the principal components model and the full data. Additionally, we don’t see any red flags in the diagnostics plots. The Adjusted R Squares are very similar and the VIF is exactly one. Based on the above diagnostical information and plots we have no reason to believe there are any multicollinearity issues.
Validation: Prediction Grades
Model Summaries:
Evaluating the two models, model1 using all of the variables and model2 using only eight principal components, we can make several different comparisons.
Model 1 based on All Variables | Model 2 based on Eight Principal Components | ||||||||||||||||||||||||||||||||||||||||
Model 1: Test Data Train=0 | Model 2: Test Data Train=0 | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
Model 1: Training Data Train=1 | Model 2: Training Data Train=1 | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
Model 1: All Data | Model 2: All Data | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
|
|
Prediction grades:
The prediction grades have three levels with Grade 1 being plus/minus 10% error, Grade 2 being plus/minus 15% error and grade three having a zero error. Utilizing the prediction grades as shown in the summary table, we can see model 1 based on the test data has a prediction grade of 12.2% of the time within plus/minus 15% error. Also on model 1 based on the training data, a 14.2% within plus/minus 15% error using all of the variables and finally, using all of the data in model 1, 14.94% with a plus/minus 15% error rate . Conversely, in model two, 14.63% of the time we are within +/- 15% of error in test set and in the training set, 13.02% of the time we are within +/15% error and 13.75% using all of the data plus/minus 15% error.
Based on our side-by-side comparison, the mean square error and mean average error are performing better in model 1 and using the test data the +/- 15% error is lower but on the training set model 2 is better. However, neither of the differences are large (12.2% v. 14.63% and 14.2% vs. 13.2%). Additionally, based on the full data, the mean square error and mean average error are better along with the +/- 15% error.
Thus, we can see that the models are comparable with only a small change of 1.19% (14.94% vs 13.75%) in the prediction grades using the full data. Compare full test set and training set – are they all in the same ball park as they are very close in value. Thus, depending on the business requirement if the +/-15% error in the model is acceptable then will be accept this model.
Whether the above is satisfactory is not depend on our business requirements. If the requirement was 20% of the time to be within +/-15% then we have met the desired threshold. However, if the threshold is 10% then we have not met the requirement and further models need to be develop.
Mean Square Error and Mean Absolute Error:
Evaluating the mean square error and mean absolute error for model 1 and model 2 is as follows:
Obs | train | MSE_1 | MAE_1 | MSE_2 | MAE_2 |
1 | 0 | .000009306 | .002144904 | .000009677 | .002179249 |
2 | 1 | .000005994 | .001902032 | .000006410 | .001975239 |
As we can see, there is a very small difference in the mean square error in both the test and training data and similarly, in the mean average error. Thus, the difference is accuracy is very small and not a concern. Also, referring to the summary table, there is a small change in model 1 versus model 2 but is not significant.
Thus, whether we are using the training, test or full data, all of the results are comparable with no large swings between any of the goodness-of-fit indicators.
Based on the above analysis, model 2 is the better model as the eight-principal component model is performing well and is the better choice over the full data model with 20 variables. For the insignificant difference, the number of parameters and perhaps intertwined with related costs in attaining the data makes better business sense. Additionally, model 2 is simpler which is preferred.
Conclusion
In conclusion, the stock portfolio data set contained the daily closing price for 20 stocks and indexed funds from Vanguard beginning Jan 3, 2012 to Dec 31, 2013. These values were transformed into log values for the purpose of explaining the variation in the log-returns of the stocks and market index. Correlations between the individual stocks and the market index were conducted with various formats of data tables and plots to visualize the correlations between the stocks and industry sectors. Principal components were analyzed to determine which components to keep and were used in a regression model. Comparing the full data to the data set that was split into a training and test set was also analyzed and contrasted. Additionally, a fitted regression model was analyzed to determine its goodness-of-fit and used to determine if multicollinearity issues existed. Finally, using eight principal components, a fitted regression model and an analysis on the variance inflation factors were discussed leading into final determination on the best model due to its simplicity which was model 2 the principal component model.
Taking a step back and looking at the forest instead of the trees, we used the stock portfolio data to create a fitted model using all of the regressors. Later after we performed an analysis on the model to determine if multicollinearity existed and transformed the regressions into log values for ease of comparison, we applied the principal components to explain 80% of the variation by choosing only eight components. Decreasing the number of components from 20 to 8, simplified the model with minimal loss in the goodness-of-fit statistics. Whether eight components are the right number depends on the business requirements and the impact on the plus/minus 15% error rate threshold as desired by the company. If the threshold is not met, further models will need to be completed and if the threshold is met, than no further work is required.