Principal Components Analysis
Principal Components Analysis
Utilizing a stock portfolio data set and the Principal Components Analysis as a method in reducing dimension and as a remedial measure for multicollinearity in Ordinary Least Squares regression. Beginning with the data, we will transform the variables into log values to explain the variation in the logreturns of the stocks and market index. Next, correlations between the individual stocks and the market index will be conducted followed by the transposing the data tables into a format and various plot to visualize the correlations will be shown. An analysis on the principal components and which components to keep based on our requirements will be applied to our regression model and later, applied to the data which will be split into a training and test set. An analysis on fitting a regression model to the individual stocks on a goodnessoffit and determination on the existence of multicollinearity issues will be provided. Similarly, using eight principal components, a fitted regression model and an analysis on the variance inflation factors will be discussed leading into final determination on the best model.
Part 1: Data & Transformation of Data
Using the dataset, stock_portfolio we load the data into SAS where we can view the daily closing price for 20 stocks and indexed funds from Vanguard(VV) beginning Jan 3, 2012 to Dec 31, 2013. New variables are created to normalize the data and calculate the daily return for each stock by using the log function for example, return_AA = log(AA/lag1(AA)) with lag being the difference of one day between the closing prices. The response or dependent variable, ‘VV’ is the variation in the logreturns of the market index which allows us to explain the variation on the individual stock returns. If we did transform the stock prices, the stocks would be at their closing dollar value which does not provide a basis for comparison on their daily return as the ‘measuring stick’ would be different for each stock.
Using logreturns of each stock we will explain the variation in the logreturns of the market index using both linear regression and principal components analysis.
The results show 502 observations, 22 variables. A quick overview:
Obs  Date  AA  BAC  BHI  KO  MMM  MPC  VV  return_AA  return_BAC  return_BHI  return_CVX  return_DD  return_DOW  response_VV 
1  03JAN2012  9.23  5.8  51.02  35.07  83.49  33.41  58.18  .  .  .  .  .  .  . 
2  04JAN2012  9.45  5.81  51.53  34.85  84.18  33.76  58.25  0.023556  0.001723  0.009946  0.001723  0.010906  0.005356558  0.001202439 
3  05JAN2012  9.36  6.31  50.82  34.68  83.8  31.92  58.44  0.009569  0.082555  0.013874  0.009850  0.006829  0.006323869  0.003256494 
4  06JAN2012  9.16  6.18  51.26  34.46  83.37  31.66  58.32  0.021599  0.020817  0.008621  0.007267  0.014234  0.005954368  .002055499 
5  09JAN2012  9.42  6.27  51.58  34.46  83.87  30.96  58.45  0.027989  0.014458  0.006223  0.010836  0.008435  .000329870  0.002226600 
Part 2: Correlation
The response variable (VV) is then used with all of the predictors to calculate the PearsonCorr correlation. The output reflects that all correlations are highly significant with the lowest correlation value at .44350 on return_DPS and the highest correlation of .76838 on return_MMM. It is testing if the correlation is equal to zero along with the p value reflecting a significance level of <.0001 and therefore, not equal to zero. Thus, at the .0001 level we know that the results are not accidental and that the results are not just by chance.
Pearson Correlation Coefficients, N = 501 Prob > r under H0: Rho=0 

return_AA  return_BAC  return_BHI  return_CVX  return_DD  return_DOW  return_DPS  
response_VV  0.63241 <.0001 
0.65019 <.0001 
0.57750 <.0001 
0.72090 <.0001 
0.68952 <.0001 
0.62645 <.0001 
0.44350 <.0001 
return_GS  return_HAL  return_HES  return_HON  return_HUN  return_JPM  return_KO  
response_VV  0.71216 <.0001 
0.59750 <.0001 
0.61080 <.0001 
0.76838 <.0001 
0.58194 <.0001 
0.65785 <.0001 
0.59980 <.0001 
return_MMM  return_MPC  return_PEP  return_SLB  return_WFC  return_XOM  
response_VV  0.76085 <.0001 
0.47312 <.0001 
0.50753 <.0001 
0.69285 <.0001 
0.73357 <.0001 
0.72111 <.0001 
Obs  correlation  tkr 
1  0.63241  AA 
2  0.65019  BAC 
3  0.57750  BHI 
4  0.72090  CVX 
5  0.68952  DD 
6  0.62645  DOW 
7  0.44350  DPS 
8  0.71216  GS 
9  0.59750  HAL 
10  0.61080  HES 
11  0.76838  HON 
12  0.58194  HUN 
13  0.65785  JPM 
14  0.59980  KO 
15  0.76085  MMM 
16  0.47312  MPC 
17  0.50753  PEP 
18  0.69285  SLB 
19  0.73357  WFC 
20  0.72111  XOM 
Part 3: Transposing to a Long Table with Stock Ticker & Correlation
Similar to part 2 reflecting the correlation but in a different format referred to as ‘long’ format with the ticket added as a separate line in SAS: tkr = substr(_NAME_,8,3); with the 8 is the position with three positions for the ticker symbol. While the correlation values do not change, it is easier to examine all of the values but with the understanding that all of the p values meet our required significance levels.
Part 4: Adding Sector Information
Similar to Part 3, we are adding information to the table. In this part, we are adding the industry sector. SAS code is again, similar to part 3 with “input tkr $ 13 sector $ 435;”. VV is the market index whereas all other items are sectors with ticket symbols. Including the sector codes provides context and insight into the different stocks as we can now see that the lowest correlated stock is DPS but that it is in the soft drink sector but can also scan the list for other ticker symbols within the soft drink sector and see that the ticker symbol PEP also has a similar correlation of .50753 and KO at .59980. Similarly, the highest correlated stock as noted above was MMM (correlation of .76085) which is in the Manufacturing sector and is similar to another manufacturing stock HON with a correlation of .76838.
Obs  correlation  tkr  sector 
1  0.63241  AA  Industrial – Metals 
2  0.65019  BAC  Banking 
3  0.57750  BHI  Oil Field Services 
4  0.72090  CVX  Oil Refining 
5  0.68952  DD  Industrial – Chemical 
6  0.62645  DOW  Industrial – Chemical 
7  0.44350  DPS  Soft Drinks 
8  0.71216  GS  Banking 
9  0.59750  HAL  Oil Field Services 
10  0.61080  HES  Oil Refining 
11  0.76838  HON  Manufacturing 
12  0.58194  HUN  Industrial – Chemical 
13  0.65785  JPM  Banking 
14  0.59980  KO  Soft Drinks 
15  0.76085  MMM  Manufacturing 
16  0.47312  MPC  Oil Refining 
17  0.50753  PEP  Soft Drinks 
18  0.69285  SLB  Oil Field Services 
19  0.73357  WFC  Banking 
20  0.72111  XOM  Oil Refining 
While the above table with the sector provides us with a lot more context, visually viewing the results is even easier showing the ticker symbol and value.
However, in the above vbar chart, matching the sectors to each stock ticker is a not very efficient. Thus, it would be easier for the reader to view the information with the sectors being grouped together.
In the above vbar chart, the reader can easily compare the different stock tickers within each industry and get a feel for the different correlations between the stock and the Vanguard index (VV). For instance, manufacturing correlations are very similar whereas oil refining stocks have the largest range. However, mentally analyzing the vbar chart, it is difficult to calculate the mean correlation for each sector.
Obs  sector  Freq  mean_ correlation 
1  Banking  4  0.68844 
2  Industrial – Chemical  3  0.63264 
3  Industrial – Metals  1  0.63241 
4  Manufacturing  2  0.76461 
5  Oil Field Services  3  0.62262 
6  Oil Refining  4  0.63148 
7  Soft Drinks  3  0.51694 
Thus, a table on the different sectors in both a table and vbar are easy to visual and compare the different sectors and with the frequency understand how many stocks are within each sector.
As we can see the highest correlation is Manufacturing with an average of .764 and the lowest being Soft Drinks with a correlation of .516.
Part 5: Principal Components Analysis
Eigenvalues of the Correlation Matrix  
Eigenvalue  Difference  Proportion  Cumulative  
1  9.63645075  8.09792128  0.4818  0.4818 
2  1.53852947  0.19109235  0.0769  0.5587 
3  1.34743712  0.39975791  0.0674  0.6261 
4  0.94767921  0.15217268  0.0474  0.6735 
5  0.79550653  0.12909860  0.0398  0.7133 
6  0.66640793  0.10798740  0.0333  0.7466 
7  0.55842052  0.04567198  0.0279  0.7745 
8  0.51274854  0.01590728  0.0256  0.8002 
9  0.49684126  0.03250822  0.0248  0.8250 
10  0.46433304  0.03089374  0.0232  0.8482 
11  0.43343929  0.02568332  0.0217  0.8699 
12  0.40775598  0.05667006  0.0204  0.8903 
13  0.35108592  0.01597897  0.0176  0.9078 
14  0.33510695  0.03813712  0.0168  0.9246 
15  0.29696984  0.02068234  0.0148  0.9394 
16  0.27628750  0.01692712  0.0138  0.9532 
17  0.25936037  0.01730228  0.0130  0.9662 
18  0.24205809  0.02020002  0.0121  0.9783 
19  0.22185807  0.01013445  0.0111  0.9894 
20  0.21172363  0.0106  1.0000 
Creating a new dataset by excluding the dependent variable – market index we can calculate the principle components. Ignoring the response variables and focusing on the 20 predictor values/return values we can reduce the number of dimensions of the 20 variables using the principal component. Evaluating only the return data. SAS allows us to quickly match all of our log variables. We can then use these to calculate the eigenvectors. The number of eigenvectors will match the number of variables.
proc princomp data=return_data out=pca_output outstat=eigenvectors plots=scree(unpackpanel);
‘out=’ will have all of the outputs from the Principal Components Analysis whereas ‘outstat’ will only contain specific statistics captured which in this case is eigenvectors. Prin1 needs to be carefully reviewed as the same term is used later with all observations
Prin1 is the weights of the first principal component with the size of the Eigenvalue will be the same number of variables. The values are the variable weights of all of the variables reflecting a linear combination of the eigenvalues and the cumulative total.
Variance Explained plot shows the number of principle components to explain the and the number of principle components you want to keep to explain the desired level of variance.
Determining the ‘right’ number of components to keep depends on the desired threshold for explaining the variation. As explained in red text, if the threshold was 80%, eight principal components would be required. If we wanted to explain 70% of the variation, we would want to keep four principal components and if we wanted to explain 90% of the variation, we would keep 13 of the principal components. Assuming 80% is our threshold, we will want to keep eight principal components.
Evaluating whether there are clusters of data, a review of the Principal value scores, reflects that they are completely independent of each other. Thus, the principal component scores can be used as another set of variables without having any multicollinearity issues as there is no correlation.
Graphically, we can view the first two principle component vectors to evaluate whether there is any relationship within the data. This is usually accomplished by reviewing the graph in a quadrantbyquadrant analysis.
Similar to the analysis above, there is a cluster of stock tickets in the first quadrant that are from the soft drink industry which are heavily weighted in Prin2 but lightly weighted in Prin1. Prin1 is mostly in the fourth quadrant is heavily weighted all of the other industries compared to the soft drinks. No surprise that we see the manufacturing tickers close together.
Part 6: Regression Model using Training and Test Data Sets
Building a regression model using the original variables and the principal components with the response variable being the index to predict which model is better – the principal component or the raw data. Split the data into a test and training set with a 70/30 split between the training and test data. Created a cross validation data. The table on the right reflects the observations used in the training data set.
Part 7: Build a Regression Model
Using the Training Data Set:
Using the training data and the test data to build the model. The regression model using the raw predictor variables, the diagnostic summary:


The data reflects 20 predictors are being used with an Adjusted RSquare of .8919 which is good. The F value of 140.04 will be used to compare against the other models and our p value is still within our threshold at <.0001.
The parameter estimates:
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  0.00008640  0.00014092  0.61  0.5403  0 
return_AA  1  0.01769  0.01317  1.34  0.1802  2.11490 
return_BAC  1  0.03198  0.01165  2.75  0.0064  3.10927 
return_BHI  1  0.00111  0.01323  0.08  0.9333  2.62997 
return_CVX  1  0.04907  0.02536  1.93  0.0539  3.07524 
return_DD  1  0.04674  0.02037  2.29  0.0224  2.51406 
return_DOW  1  0.03642  0.01162  3.14  0.0019  1.88893 
return_DPS  1  0.03670  0.01679  2.19  0.0295  1.54768 
return_GS  1  0.04849  0.01555  3.12  0.0020  3.10450 
return_HAL  1  0.00948  0.01466  0.65  0.5184  3.08758 
return_HES  1  0.00359  0.01092  0.33  0.7425  2.10199 
return_HON  1  0.12213  0.01924  6.35  <.0001  2.73505 
return_HUN  1  0.02712  0.00836  3.24  0.0013  1.79852 
return_JPM  1  0.00902  0.01708  0.53  0.5979  3.36439 
return_KO  1  0.07903  0.02226  3.55  0.0004  1.93633 
return_MMM  1  0.09796  0.02646  3.70  0.0003  2.98277 
return_MPC  1  0.01673  0.00809  2.07  0.0394  1.32999 
return_PEP  1  0.02911  0.02231  1.30  0.1929  1.68825 
return_SLB  1  0.03776  0.01709  2.21  0.0279  3.13690 
return_WFC  1  0.07587  0.01848  4.10  <.0001  2.59492 
return_XOM  1  0.05467  0.02697  2.03  0.0435  2.98393 
Reviewing the table, it seems weird that our F value is significant yet, the tvalues are all nonsignificant. Thus, there is a chance that we have multicollinearity taking place. As such, we evaluate the Variance Inflation Factor (VIF) which measures the dependencies among the regressors. VIFs values that exceed 5 or 10 are an indication that the associated regression coefficients are poorly estimated due to multicollinearity. With our highest value of 3.36439 which is on the low end of multicollinearity, we do not have an issue. Some of the p values are also not significant.
As we can see above, the data fits the line well in the QQ Plot, Cook’s D has low values with a maximum of .15. Residuals are scattered in a random matter.
Using ALL of the Data:
Using all of the data, the diagnostic information on the second model is as follows:


Similar to the first regression model we still have 20 regressors. Comparing the goodnessoffit statistics using all of the data, Adjusted RSquare is very similar with a value of .8768 (vs .8919) and an improvement in the F value to 178.96 (vs. 140.04).
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  0.00009953  0.00012133  0.82  0.4124  0 
return_AA  1  0.01538  0.01040  1.48  0.1399  2.02627 
return_BAC  1  0.02723  0.00970  2.81  0.0052  2.68654 
return_BHI  1  0.01604  0.01161  1.38  0.1678  2.65137 
return_CVX  1  0.05742  0.02068  2.78  0.0057  2.92019 
return_DD  1  0.01003  0.01625  0.62  0.5371  2.44149 
return_DOW  1  0.03600  0.01069  3.37  0.0008  1.96589 
return_DPS  1  0.05659  0.01493  3.79  0.0002  1.52563 
return_GS  1  0.03434  0.01358  2.53  0.0118  3.19781 
return_HAL  1  0.00198  0.01210  0.16  0.8703  2.91992 
return_HES  1  0.00439  0.00969  0.45  0.6504  2.09606 
return_HON  1  0.10707  0.01608  6.66  <.0001  2.45588 
return_HUN  1  0.02867  0.00722  3.97  <.0001  1.74391 
return_JPM  1  0.02224  0.01329  1.67  0.0948  2.87596 
return_KO  1  0.09425  0.01847  5.10  <.0001  1.98165 
return_MMM  1  0.10928  0.02202  4.96  <.0001  2.68494 
return_MPC  1  0.01079  0.00702  1.54  0.1251  1.37671 
return_PEP  1  0.02092  0.02034  1.03  0.3043  1.72066 
return_SLB  1  0.04851  0.01453  3.34  0.0009  3.25898 
return_WFC  1  0.07738  0.01580  4.90  <.0001  2.53263 
return_XOM  1  0.05797  0.02301  2.52  0.0121  2.94983 
As reflected in the table above, the maximum VIF value is 3.19781 which is a decrease from 3.36439. Also, some of the p values are not significant. Thus, this regression model appears to be performing very similar to the training model.
Evaluating the diagnostic plots, everything is very similar to the analysis provided in the first regression model using the training data with a minor change in the scale of the Cook’s D plot with the scale changing from .15 to .125 and a more compressed scale on the QQ plot.
Overall, the diagnostic plots look good. Based on the above, this model does not have multicollinearity issues. For ease of comparison, the mean square error and mean absolute error will be discussed in part 8.
Part 8: Build a Regression Model Using Principal Components
Principal component analysis is used to reduce the dimensionality of a data set consisting of a large number of interrelated variables while trying to retain the variation in the data set. By transforming to a new set of variables, the uncorrelated principal components are ordered so that the first few retain most of the variation present in all of the original variables (Jolliffe, I.T., 2010. Principal Component Analysis, 2nd edition). Using the top eight principal components to build the model based on the training and test data to determine which model fits better.
Training Dataset:
Fitting a regression model to eight selected principal components, the diagnostic information is as follows:


Comparing the above information to the prior models, we can see that the Adjusted RSquare is similar to the other models (.8919 and .8768) and an increase in the F Value to 337.13 (an increase from 140.04 and 178.96) in the prior two models using all 20 regressors.
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  0.00075978  0.00014045  5.41  <.0001  0 
Prin1  1  0.00231  0.00004519  51.05  <.0001  1.00527 
Prin2  1  0.00032245  0.00011425  2.82  0.0051  1.00868 
Prin3  1  0.00070635  0.00012322  5.73  <.0001  1.00861 
Prin4  1  0.00030481  0.00014536  2.10  0.0368  1.00636 
Prin5  1  0.00017356  0.00015516  1.12  0.2641  1.00297 
Prin6  1  0.00000315  0.00017108  0.02  0.9853  1.00766 
Prin7  1  0.00010331  0.00018604  0.56  0.5791  1.02315 
Prin8  1  0.00040760  0.00020293  2.01  0.0454  1.02271 
From the above table, we see some of the p values/predictors are not significant. This is especially true in the independent variables four through eight.
The diagnostic plots are very similar to the prior model.
Test/Full Dataset:
Similar to the above but now applying the eight principal components to the full data set:


Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t  Variance Inflation 
Intercept  1  0.00075200  0.00012218  6.15  <.0001  0 
Prin1  1  0.00227  0.00003940  57.55  <.0001  1.00000 
Prin2  1  0.00045862  0.00009860  4.65  <.0001  1.00000 
Prin3  1  0.00050372  0.00010536  4.78  <.0001  1.00000 
Prin4  1  0.00015525  0.00012563  1.24  0.2171  1.00000 
Prin5  1  0.00018175  0.00013713  1.33  0.1856  1.00000 
Prin6  1  0.00008231  0.00014982  0.55  0.5830  1.00000 
Prin7  1  0.00003961  0.00016367  0.24  0.8088  1.00000 
Prin8  1  0.00035470  0.00017080  2.08  0.0383  1.00000 
In comparison to the test data, the first four p values/predictors are now significant. The VIF values are all one’s because all the principal components are independent with a correlation of 0, therefore minimum VIF is 1. A VIF of 1 is ideal as it reflects that the regressors are completely unrelated.
The above plots are also very similar. The residual plot has a few outliers, the QQ plot fits the line and the Cook’s D is small with a scale maximum of 0.15.
Overall, there are only a few minor differences between the principal components model and the full data. Additionally, we don’t see any red flags in the diagnostics plots. The Adjusted R Squares are very similar and the VIF is exactly one. Based on the above diagnostical information and plots we have no reason to believe there are any multicollinearity issues.
Validation: Prediction Grades
Model Summaries:
Evaluating the two models, model1 using all of the variables and model2 using only eight principal components, we can make several different comparisons.
Model 1 based on All Variables  Model 2 based on Eight Principal Components  
Model 1: Test Data Train=0  Model 2: Test Data Train=0  



Model 1: Training Data Train=1  Model 2: Training Data Train=1  



Model 1: All Data  Model 2: All Data  





Prediction grades:
The prediction grades have three levels with Grade 1 being plus/minus 10% error, Grade 2 being plus/minus 15% error and grade three having a zero error. Utilizing the prediction grades as shown in the summary table, we can see model 1 based on the test data has a prediction grade of 12.2% of the time within plus/minus 15% error. Also on model 1 based on the training data, a 14.2% within plus/minus 15% error using all of the variables and finally, using all of the data in model 1, 14.94% with a plus/minus 15% error rate . Conversely, in model two, 14.63% of the time we are within +/ 15% of error in test set and in the training set, 13.02% of the time we are within +/15% error and 13.75% using all of the data plus/minus 15% error.
Based on our sidebyside comparison, the mean square error and mean average error are performing better in model 1 and using the test data the +/ 15% error is lower but on the training set model 2 is better. However, neither of the differences are large (12.2% v. 14.63% and 14.2% vs. 13.2%). Additionally, based on the full data, the mean square error and mean average error are better along with the +/ 15% error.
Thus, we can see that the models are comparable with only a small change of 1.19% (14.94% vs 13.75%) in the prediction grades using the full data. Compare full test set and training set – are they all in the same ball park as they are very close in value. Thus, depending on the business requirement if the +/15% error in the model is acceptable then will be accept this model.
Whether the above is satisfactory is not depend on our business requirements. If the requirement was 20% of the time to be within +/15% then we have met the desired threshold. However, if the threshold is 10% then we have not met the requirement and further models need to be develop.
Mean Square Error and Mean Absolute Error:
Evaluating the mean square error and mean absolute error for model 1 and model 2 is as follows:
Obs  train  MSE_1  MAE_1  MSE_2  MAE_2 
1  0  .000009306  .002144904  .000009677  .002179249 
2  1  .000005994  .001902032  .000006410  .001975239 
As we can see, there is a very small difference in the mean square error in both the test and training data and similarly, in the mean average error. Thus, the difference is accuracy is very small and not a concern. Also, referring to the summary table, there is a small change in model 1 versus model 2 but is not significant.
Thus, whether we are using the training, test or full data, all of the results are comparable with no large swings between any of the goodnessoffit indicators.
Based on the above analysis, model 2 is the better model as the eightprincipal component model is performing well and is the better choice over the full data model with 20 variables. For the insignificant difference, the number of parameters and perhaps intertwined with related costs in attaining the data makes better business sense. Additionally, model 2 is simpler which is preferred.
Conclusion
In conclusion, the stock portfolio data set contained the daily closing price for 20 stocks and indexed funds from Vanguard beginning Jan 3, 2012 to Dec 31, 2013. These values were transformed into log values for the purpose of explaining the variation in the logreturns of the stocks and market index. Correlations between the individual stocks and the market index were conducted with various formats of data tables and plots to visualize the correlations between the stocks and industry sectors. Principal components were analyzed to determine which components to keep and were used in a regression model. Comparing the full data to the data set that was split into a training and test set was also analyzed and contrasted. Additionally, a fitted regression model was analyzed to determine its goodnessoffit and used to determine if multicollinearity issues existed. Finally, using eight principal components, a fitted regression model and an analysis on the variance inflation factors were discussed leading into final determination on the best model due to its simplicity which was model 2 the principal component model.
Taking a step back and looking at the forest instead of the trees, we used the stock portfolio data to create a fitted model using all of the regressors. Later after we performed an analysis on the model to determine if multicollinearity existed and transformed the regressions into log values for ease of comparison, we applied the principal components to explain 80% of the variation by choosing only eight components. Decreasing the number of components from 20 to 8, simplified the model with minimal loss in the goodnessoffit statistics. Whether eight components are the right number depends on the business requirements and the impact on the plus/minus 15% error rate threshold as desired by the company. If the threshold is not met, further models will need to be completed and if the threshold is met, than no further work is required.