# Principal Components Analysis

Utilizing a stock portfolio data set and the Principal Components Analysis as a method in reducing dimension and as a remedial measure for multicollinearity in Ordinary Least Squares regression.  Beginning with the data, we will transform the variables into log values to explain the variation in the log-returns of the stocks and market index.  Next, correlations between the individual stocks and the market index will be conducted followed by the transposing the data tables into a format and various plot to visualize the correlations will be shown.  An analysis on the principal components and which components to keep based on our requirements will be applied to our regression model and later, applied to the data which will be split into a training and test set.  An analysis on fitting a regression model to the individual stocks on a goodness-of-fit and determination on the existence of multicollinearity issues will be provided.  Similarly, using eight principal components, a fitted regression model and an analysis on the variance inflation factors will be discussed leading into final determination on the best model.

Part 1:  Data & Transformation of Data

Using the dataset, stock_portfolio we load the data into SAS where we can view the daily closing price for 20 stocks and indexed funds from Vanguard(VV) beginning Jan 3, 2012 to Dec 31, 2013.  New variables are created to normalize the data and calculate the daily return for each stock by using the log function for example, return_AA  = log(AA/lag1(AA)) with lag being the difference of one day between the closing prices. The response or dependent variable, ‘VV’ is the variation in the log-returns of the market index which allows us to explain the variation on the individual stock returns.  If we did transform the stock prices, the stocks would be at their closing dollar value which does not provide a basis for comparison on their daily return as the ‘measuring stick’ would be different for each stock.

Using log-returns of each stock we will explain the variation in the log-returns of the market index using both linear regression and principal components analysis.

The results show 502 observations, 22 variables.  A quick overview:

 Obs Date AA BAC BHI KO MMM MPC VV return_AA return_BAC return_BHI return_CVX return_DD return_DOW response_VV 1 03JAN2012 9.23 5.8 51.02 35.07 83.49 33.41 58.18 . . . . . . . 2 04JAN2012 9.45 5.81 51.53 34.85 84.18 33.76 58.25 0.023556 0.001723 0.009946 -0.001723 0.010906 0.005356558 0.001202439 3 05JAN2012 9.36 6.31 50.82 34.68 83.8 31.92 58.44 -0.009569 0.082555 -0.013874 -0.009850 -0.006829 0.006323869 0.003256494 4 06JAN2012 9.16 6.18 51.26 34.46 83.37 31.66 58.32 -0.021599 -0.020817 0.008621 -0.007267 -0.014234 0.005954368 -.002055499 5 09JAN2012 9.42 6.27 51.58 34.46 83.87 30.96 58.45 0.027989 0.014458 0.006223 0.010836 0.008435 -.000329870 0.002226600

## Part 2:  Correlation

The response variable (VV) is then used with all of the predictors to calculate the PearsonCorr correlation.  The output reflects that all correlations are highly significant with the lowest correlation value at .44350 on return_DPS and the highest correlation of .76838 on return_MMM.  It is testing if the correlation is equal to zero along with the p value reflecting a significance level of  <.0001 and therefore, not equal to zero.  Thus, at the .0001 level we know that the results are not accidental and that the results are not just by chance.

 Pearson Correlation Coefficients, N = 501 Prob > |r| under H0: Rho=0 return_AA return_BAC return_BHI return_CVX return_DD return_DOW return_DPS response_VV 0.63241 <.0001 0.65019 <.0001 0.57750 <.0001 0.72090 <.0001 0.68952 <.0001 0.62645 <.0001 0.44350 <.0001 return_GS return_HAL return_HES return_HON return_HUN return_JPM return_KO response_VV 0.71216 <.0001 0.59750 <.0001 0.61080 <.0001 0.76838 <.0001 0.58194 <.0001 0.65785 <.0001 0.59980 <.0001 return_MMM return_MPC return_PEP return_SLB return_WFC return_XOM response_VV 0.76085 <.0001 0.47312 <.0001 0.50753 <.0001 0.69285 <.0001 0.73357 <.0001 0.72111 <.0001

 Obs correlation tkr 1 0.63241 AA 2 0.65019 BAC 3 0.57750 BHI 4 0.72090 CVX 5 0.68952 DD 6 0.62645 DOW 7 0.44350 DPS 8 0.71216 GS 9 0.59750 HAL 10 0.61080 HES 11 0.76838 HON 12 0.58194 HUN 13 0.65785 JPM 14 0.59980 KO 15 0.76085 MMM 16 0.47312 MPC 17 0.50753 PEP 18 0.69285 SLB 19 0.73357 WFC 20 0.72111 XOM

## Part 3:  Transposing to a Long Table with Stock Ticker & Correlation

Similar to part 2 reflecting the correlation but in a different format referred to as ‘long’ format with the ticket added as a separate line in SAS:   tkr = substr(_NAME_,8,3); with the 8 is the position with three positions for the ticker symbol. While the correlation values do not change, it is easier to examine all of the values but with the understanding that all of the p values meet our required significance levels.

## Part 4:  Adding Sector Information

Similar to Part 3, we are adding information to the table.  In this part, we are adding the industry sector.  SAS code is again, similar to part 3 with “input tkr \$ 1-3 sector \$ 4-35;”.  VV is the market index whereas all other items are sectors with ticket symbols.  Including the sector codes provides context and insight into the different stocks as we can now see that the lowest correlated stock is DPS but that it is in the soft drink sector but can also scan the list for other ticker symbols within the soft drink sector and see that the ticker symbol PEP also has a similar correlation of .50753 and KO at .59980.  Similarly, the highest correlated stock as noted above was MMM (correlation of .76085) which is in the Manufacturing sector and is similar to another manufacturing stock HON with a correlation of .76838.

 Obs correlation tkr sector 1 0.63241 AA Industrial – Metals 2 0.65019 BAC Banking 3 0.57750 BHI Oil Field Services 4 0.72090 CVX Oil Refining 5 0.68952 DD Industrial – Chemical 6 0.62645 DOW Industrial – Chemical 7 0.44350 DPS Soft Drinks 8 0.71216 GS Banking 9 0.59750 HAL Oil Field Services 10 0.61080 HES Oil Refining 11 0.76838 HON Manufacturing 12 0.58194 HUN Industrial – Chemical 13 0.65785 JPM Banking 14 0.59980 KO Soft Drinks 15 0.76085 MMM Manufacturing 16 0.47312 MPC Oil Refining 17 0.50753 PEP Soft Drinks 18 0.69285 SLB Oil Field Services 19 0.73357 WFC Banking 20 0.72111 XOM Oil Refining

While the above table with the sector provides us with a lot more context, visually viewing the results is even easier showing the ticker symbol and value.

However, in the above vbar chart, matching the sectors to each stock ticker is a not very efficient.  Thus, it would be easier for the reader to view the information with the sectors being grouped together.

In the above vbar chart, the reader can easily compare the different stock tickers within each industry and get a feel for the different correlations between the stock and the Vanguard index (VV).  For instance, manufacturing correlations are very similar whereas oil refining stocks have the largest range.  However, mentally analyzing the vbar chart, it is difficult to calculate the mean correlation for each sector.

 Obs sector Freq mean_ correlation 1 Banking 4 0.68844 2 Industrial – Chemical 3 0.63264 3 Industrial – Metals 1 0.63241 4 Manufacturing 2 0.76461 5 Oil Field Services 3 0.62262 6 Oil Refining 4 0.63148 7 Soft Drinks 3 0.51694

Thus, a table on the different sectors in both a table and vbar are easy to visual and compare the different sectors and with the frequency understand how many stocks are within each sector.

As we can see the highest correlation is Manufacturing with an average of .764 and the lowest being Soft Drinks with a correlation of .516.

## Part 5:  Principal Components Analysis

 Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative 1 9.63645075 8.09792128 0.4818 0.4818 2 1.53852947 0.19109235 0.0769 0.5587 3 1.34743712 0.39975791 0.0674 0.6261 4 0.94767921 0.15217268 0.0474 0.6735 5 0.79550653 0.12909860 0.0398 0.7133 6 0.66640793 0.10798740 0.0333 0.7466 7 0.55842052 0.04567198 0.0279 0.7745 8 0.51274854 0.01590728 0.0256 0.8002 9 0.49684126 0.03250822 0.0248 0.8250 10 0.46433304 0.03089374 0.0232 0.8482 11 0.43343929 0.02568332 0.0217 0.8699 12 0.40775598 0.05667006 0.0204 0.8903 13 0.35108592 0.01597897 0.0176 0.9078 14 0.33510695 0.03813712 0.0168 0.9246 15 0.29696984 0.02068234 0.0148 0.9394 16 0.27628750 0.01692712 0.0138 0.9532 17 0.25936037 0.01730228 0.0130 0.9662 18 0.24205809 0.02020002 0.0121 0.9783 19 0.22185807 0.01013445 0.0111 0.9894 20 0.21172363 0.0106 1.0000

Creating a new dataset by excluding the dependent variable – market index we can calculate the principle components.  Ignoring the response variables and focusing on the 20 predictor values/return values we can reduce the number of dimensions of the 20 variables using the principal component.  Evaluating only the return data.  SAS allows us to quickly match all of our log variables.  We can then use these to calculate the eigenvectors. The number of eigenvectors will match the number of variables.

proc princomp data=return_data out=pca_output outstat=eigenvectors plots=scree(unpackpanel);

‘out=’ will have all of the outputs from the Principal Components Analysis whereas ‘outstat’ will only contain specific statistics captured which in this case is eigenvectors. Prin1 needs to be carefully reviewed as the same term is used later with all observations

Prin1 is the weights of the first principal component with the size of the Eigenvalue will be the same number of variables.  The values are the variable weights of all of the variables reflecting a linear combination of the eigenvalues and the cumulative total.

Variance Explained plot shows the number of principle components to explain the and the number of principle components you want to keep to explain the desired level of variance.

Determining the ‘right’ number of components to keep depends on the desired threshold for explaining the variation.  As explained in red text, if the threshold was 80%, eight principal components would be required.  If we wanted to explain 70% of the variation, we would want to keep four principal components and if we wanted to explain 90% of the variation, we would keep 13 of the principal components.  Assuming 80% is our threshold, we will want to keep eight principal components.

Evaluating whether there are clusters of data, a review of the Principal value scores, reflects that they are completely independent of each other.  Thus, the principal component scores can be used as another set of variables without having any multicollinearity issues as there is no correlation.

Graphically, we can view the first two principle component vectors to evaluate whether there is any relationship within the data.  This is usually accomplished by reviewing the graph in a quadrant-by-quadrant analysis. Similar to the analysis above, there is a cluster of stock tickets in the first quadrant that are from the soft drink industry which are heavily weighted in Prin2 but lightly weighted in Prin1.  Prin1 is mostly in the fourth quadrant is heavily weighted all of the other industries compared to the soft drinks.   No surprise that we see the manufacturing tickers close together.

## Part 6:  Regression Model using Training and Test Data Sets

Building a regression model using the original variables and the principal components with the response variable being the index to predict which model is better – the principal component or the raw data.  Split the data into a test and training set with a 70/30 split between the training and test data.  Created a cross validation data.  The table on the right reflects the observations used in the training data set.

## Part 7:  Build a Regression Model

### Using the Training Data Set:

Using the training data and the test data to build the model. The regression model using the raw predictor variables, the diagnostic summary:

 Number of Observations Read 502 Number of Observations Used 338 Number of Observations with Missing Values 164
 Root MSE 0.00253 R-Square 0.8983 Dependent Mean 0.00061635 Adj R-Sq 0.8919 Coeff Var 410.185
 Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 20 0.01790 0.00089510 140.04 <.0001 Error 317 0.00203 0.00000639 Corrected Total 337 0.01993

The data reflects 20 predictors are being used with an Adjusted R-Square of .8919 which is good.  The F value of 140.04 will be used to compare against the other models and our p value is still within our threshold at <.0001.

The parameter estimates:

 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Variance Inflation Intercept 1 0.00008640 0.00014092 0.61 0.5403 0 return_AA 1 0.01769 0.01317 1.34 0.1802 2.11490 return_BAC 1 0.03198 0.01165 2.75 0.0064 3.10927 return_BHI 1 -0.00111 0.01323 -0.08 0.9333 2.62997 return_CVX 1 0.04907 0.02536 1.93 0.0539 3.07524 return_DD 1 0.04674 0.02037 2.29 0.0224 2.51406 return_DOW 1 0.03642 0.01162 3.14 0.0019 1.88893 return_DPS 1 0.03670 0.01679 2.19 0.0295 1.54768 return_GS 1 0.04849 0.01555 3.12 0.0020 3.10450 return_HAL 1 0.00948 0.01466 0.65 0.5184 3.08758 return_HES 1 0.00359 0.01092 0.33 0.7425 2.10199 return_HON 1 0.12213 0.01924 6.35 <.0001 2.73505 return_HUN 1 0.02712 0.00836 3.24 0.0013 1.79852 return_JPM 1 0.00902 0.01708 0.53 0.5979 3.36439 return_KO 1 0.07903 0.02226 3.55 0.0004 1.93633 return_MMM 1 0.09796 0.02646 3.70 0.0003 2.98277 return_MPC 1 0.01673 0.00809 2.07 0.0394 1.32999 return_PEP 1 0.02911 0.02231 1.30 0.1929 1.68825 return_SLB 1 0.03776 0.01709 2.21 0.0279 3.13690 return_WFC 1 0.07587 0.01848 4.10 <.0001 2.59492 return_XOM 1 0.05467 0.02697 2.03 0.0435 2.98393

Reviewing the table, it seems weird that our F value is significant yet, the t-values are all non-significant.  Thus, there is a chance that we have multicollinearity taking place.  As such, we evaluate the Variance Inflation Factor (VIF) which measures the dependencies among the regressors.  VIFs values that exceed 5 or 10 are an indication that the associated regression coefficients are poorly estimated due to multicollinearity. With our highest value of 3.36439 which is on the low end of multicollinearity, we do not have an issue.  Some of the p values are also not significant.

As we can see above, the data fits the line well in the Q-Q Plot, Cook’s D has low values with a maximum of .15.  Residuals are scattered in a random matter.

### Using ALL of the Data:

Using all of the data, the diagnostic information on the second model is as follows:

 Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 20 0.02543 0.00127 178.96 <.0001 Error 480 0.00341 0.00000711 Corrected Total 500 0.02884
 Root MSE 0.00267 R-Square 0.8818 Dependent Mean 0.000752 Adj R-Sq 0.8768 Coeff Var 354.457

Similar to the first regression model we still have 20 regressors.  Comparing the goodness-of-fit statistics using all of the data, Adjusted R-Square is very similar with a value of .8768 (vs .8919) and an improvement in the F value to 178.96 (vs. 140.04).

 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Variance Inflation Intercept 1 0.00009953 0.00012133 0.82 0.4124 0 return_AA 1 0.01538 0.01040 1.48 0.1399 2.02627 return_BAC 1 0.02723 0.00970 2.81 0.0052 2.68654 return_BHI 1 0.01604 0.01161 1.38 0.1678 2.65137 return_CVX 1 0.05742 0.02068 2.78 0.0057 2.92019 return_DD 1 0.01003 0.01625 0.62 0.5371 2.44149 return_DOW 1 0.03600 0.01069 3.37 0.0008 1.96589 return_DPS 1 0.05659 0.01493 3.79 0.0002 1.52563 return_GS 1 0.03434 0.01358 2.53 0.0118 3.19781 return_HAL 1 -0.00198 0.01210 -0.16 0.8703 2.91992 return_HES 1 0.00439 0.00969 0.45 0.6504 2.09606 return_HON 1 0.10707 0.01608 6.66 <.0001 2.45588 return_HUN 1 0.02867 0.00722 3.97 <.0001 1.74391 return_JPM 1 0.02224 0.01329 1.67 0.0948 2.87596 return_KO 1 0.09425 0.01847 5.10 <.0001 1.98165 return_MMM 1 0.10928 0.02202 4.96 <.0001 2.68494 return_MPC 1 0.01079 0.00702 1.54 0.1251 1.37671 return_PEP 1 0.02092 0.02034 1.03 0.3043 1.72066 return_SLB 1 0.04851 0.01453 3.34 0.0009 3.25898 return_WFC 1 0.07738 0.01580 4.90 <.0001 2.53263 return_XOM 1 0.05797 0.02301 2.52 0.0121 2.94983

As reflected in the table above, the maximum VIF value is 3.19781 which is a decrease from 3.36439. Also, some of the p values are not significant.  Thus, this regression model appears to be performing very similar to the training model.

Evaluating the diagnostic plots, everything is very similar to the analysis provided in the first regression model using the training data with a minor change in the scale of the Cook’s D plot with the scale changing from .15 to .125 and a more compressed scale on the Q-Q plot.

Overall, the diagnostic plots look good.  Based on the above, this model does not have multicollinearity issues.  For ease of comparison, the mean square error and mean absolute error will be discussed in part 8.

## Part 8:  Build a Regression Model Using Principal Components

Principal component analysis is used to reduce the dimensionality of a data set consisting of a large number of interrelated variables while trying to retain the variation in the data set.  By transforming to a new set of variables, the uncorrelated principal components are ordered so that the first few retain most of the variation present in all of the original variables (Jolliffe, I.T., 2010.  Principal Component Analysis, 2nd edition). Using the top eight principal components to build the model based on the training and test data to determine which model fits better.

### Training Dataset:

Fitting a regression model to eight selected principal components, the diagnostic information is as follows:

 Number of Observations Read 502 Number of Observations Used 338 Number of Observations with Missing Values 164
 Root MSE 0.00257 R-Square 0.8913 Dependent Mean 0.00061635 Adj R-Sq 0.8886 Coeff Var 416.365
 Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 8 0.01776 0.00222 337.13 <.0001 Error 329 0.00217 0.00000659 Corrected Total 337 0.01993

Comparing the above information to the prior models, we can see that the Adjusted R-Square is similar to the other models (.8919 and .8768) and an increase in the F Value to 337.13 (an increase from 140.04 and 178.96) in the prior two models using all 20 regressors.

 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Variance Inflation Intercept 1 0.00075978 0.00014045 5.41 <.0001 0 Prin1 1 0.00231 0.00004519 51.05 <.0001 1.00527 Prin2 1 0.00032245 0.00011425 2.82 0.0051 1.00868 Prin3 1 0.00070635 0.00012322 5.73 <.0001 1.00861 Prin4 1 0.00030481 0.00014536 2.10 0.0368 1.00636 Prin5 1 -0.00017356 0.00015516 -1.12 0.2641 1.00297 Prin6 1 0.00000315 0.00017108 0.02 0.9853 1.00766 Prin7 1 -0.00010331 0.00018604 -0.56 0.5791 1.02315 Prin8 1 -0.00040760 0.00020293 -2.01 0.0454 1.02271

From the above table, we see some of the p values/predictors are not significant.  This is especially true in the independent variables four through eight.

The diagnostic plots are very similar to the prior model.

### Test/Full Dataset:

Similar to the above but now applying the eight principal components to the full data set:

 Number of Observations Read 502 Number of Observations Used 501 Number of Observations with Missing Values 1
 Root MSE 0.00273 R-Square 0.8724 Dependent Mean 0.000752 Adj R-Sq 0.8703 Coeff Var 363.668
 Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 8 0.02516 0.00315 420.52 <.0001 Error 492 0.00368 0.00000748 Corrected Total 500 0.02884

 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Variance Inflation Intercept 1 0.00075200 0.00012218 6.15 <.0001 0 Prin1 1 0.00227 0.00003940 57.55 <.0001 1.00000 Prin2 1 0.00045862 0.00009860 4.65 <.0001 1.00000 Prin3 1 0.00050372 0.00010536 4.78 <.0001 1.00000 Prin4 1 0.00015525 0.00012563 1.24 0.2171 1.00000 Prin5 1 -0.00018175 0.00013713 -1.33 0.1856 1.00000 Prin6 1 0.00008231 0.00014982 0.55 0.5830 1.00000 Prin7 1 -0.00003961 0.00016367 -0.24 0.8088 1.00000 Prin8 1 -0.00035470 0.00017080 -2.08 0.0383 1.00000

In comparison to the test data, the first four p values/predictors are now significant.  The VIF values are all one’s because all the principal components are independent with a correlation of 0, therefore minimum VIF is 1.  A VIF of 1 is ideal as it reflects that the regressors are completely unrelated. Principal Components Analysis -Full Data Set with Eight Components Only

The above plots are also very similar.  The residual plot has a few outliers, the Q-Q plot fits the line and the Cook’s D is small with a scale maximum of 0.15.

Overall, there are only a few minor differences between the principal components model and the full data.  Additionally, we don’t see any red flags in the diagnostics plots.  The Adjusted R Squares are very similar and the VIF is exactly one.  Based on the above diagnostical information and plots we have no reason to believe there are any multicollinearity issues.

### Model Summaries:

Evaluating the two models, model1 using all of the variables and model2 using only eight principal components, we can make several different comparisons.

Model 1 based on All Variables Model 2 based on Eight Principal Components
Model 1:    Test Data    Train=0 Model 2:  Test Data     Train=0
 Prediction_Grade Frequency Percent Cumulative Frequency Cumulative Percent Grade 1 15 9.15 15 9.15 Grade 2 5 3.05 20 12.20 Grade 3 144 87.80 164 100.00
 Prediction_Grade Frequency Percent Cumulative Frequency Cumulative Percent Grade 1 16 9.76 16 9.76 Grade 2 8 4.88 24 14.63 Grade 3 140 85.37 164 100.00
Model 1:  Training Data      Train=1 Model 2:   Training Data    Train=1
 Prediction_Grade Frequency Percent Cumulative Frequency Cumulative Percent Grade 1 31 9.17 31 9.17 Grade 2 17 5.03 48 14.20 Grade 3 290 85.80 338 100.00
 Prediction_Grade Frequency Percent Cumulative Frequency Cumulative Percent Grade 1 31 9.17 31 9.17 Grade 2 13 3.85 44 13.02 Grade 3 294 86.98 338 100.00
Model 1:   All Data Model 2:   All Data
 Prediction_Grade Frequency Percent Cumulative Frequency Cumulative Percent Grade 1 52 10.36 52 10.36 Grade 2 23 4.58 75 14.94 Grade 3 427 85.06 502 100.00
 Prediction_Grade Frequency Percent Cumulative Frequency Cumulative Percent Grade 1 52 10.36 52 10.36 Grade 2 17 3.39 69 13.75 Grade 3 433 86.25 502 100.00
 Obs _TYPE_ _FREQ_ MSE_1 MAE_1 1 0 502 .000006807 .001949987
 Obs _TYPE_ _FREQ_ MSE_2 MAE_2 1 0 502 .000007345 .002020074

The prediction grades have three levels with Grade 1 being plus/minus 10% error, Grade 2 being plus/minus 15% error and grade three having a zero error.  Utilizing the prediction grades as shown in the summary table, we can see model 1 based on the test data has a prediction grade of 12.2% of the time within plus/minus 15% error.  Also on model 1 based on the training data, a 14.2% within plus/minus 15% error using all of the variables and finally, using all of the data in model 1, 14.94% with a plus/minus 15% error rate .   Conversely, in model two, 14.63% of the time we are within +/- 15% of error in test set and in the training set, 13.02% of the time we are within +/15% error and 13.75% using all of the data plus/minus 15% error.

Based on our side-by-side comparison, the mean square error and mean average error are performing better in model 1 and using the test data the +/- 15% error is lower but on the training set model 2 is better.  However, neither of the differences are large (12.2% v. 14.63% and 14.2% vs. 13.2%).  Additionally, based on the full data, the mean square error and mean average error are better along with the +/- 15% error.

Thus, we can see that the models are comparable with only a small change of 1.19% (14.94% vs 13.75%) in the prediction grades using the full data.  Compare full test set and training set – are they all in the same ball park as they are very close in value.   Thus, depending on the business requirement if the +/-15% error in the model is acceptable then will be accept this model.

Whether the above is satisfactory is not depend on our business requirements.  If the requirement was 20% of the time to be within +/-15% then we have met the desired threshold.  However, if the threshold is 10% then we have not met the requirement and further models need to be develop.

### Mean Square Error and Mean Absolute Error:

Evaluating the mean square error and mean absolute error for model 1 and model 2 is as follows:

 Obs train MSE_1 MAE_1 MSE_2 MAE_2 1 0 .000009306 .002144904 .000009677 .002179249 2 1 .000005994 .001902032 .000006410 .001975239

As we can see, there is a very small difference in the mean square error in both the test and training data and similarly, in the mean average error.  Thus, the difference is accuracy is very small and not a concern. Also, referring to the summary table, there is a small change in model 1 versus model 2 but is not significant.

Thus, whether we are using the training, test or full data, all of the results are comparable with no large swings between any of the goodness-of-fit indicators.

Based on the above analysis, model 2 is the better model as the eight-principal component model is performing well and is the better choice over the full data model with 20 variables.  For the insignificant difference, the number of parameters and perhaps intertwined with related costs in attaining the data makes better business sense.  Additionally, model 2 is simpler which is preferred.

# Conclusion

In conclusion, the stock portfolio data set contained the daily closing price for 20 stocks and indexed funds from Vanguard beginning Jan 3, 2012 to Dec 31, 2013.  These values were transformed into log values for the purpose of explaining the variation in the log-returns of the stocks and market index.  Correlations between the individual stocks and the market index were conducted with various formats of data tables and plots to visualize the correlations between the stocks and industry sectors. Principal components were analyzed to determine which components to keep and were used in a regression model.  Comparing the full data to the data set that was split into a training and test set was also analyzed and contrasted. Additionally, a fitted regression model was analyzed to determine its goodness-of-fit and used to determine if multicollinearity issues existed.  Finally, using eight principal components, a fitted regression model and an analysis on the variance inflation factors were discussed leading into final determination on the best model due to its simplicity which was model 2 the principal component model.

Taking a step back and looking at the forest instead of the trees, we used the stock portfolio data to create a fitted model using all of the regressors.  Later after we performed an analysis on the model to determine if multicollinearity existed and transformed the regressions into log values for ease of comparison, we applied the principal components to explain 80% of the variation by choosing only eight components.  Decreasing the number of components from 20 to 8, simplified the model with minimal loss in the goodness-of-fit statistics.   Whether eight components are the right number depends on the business requirements and the impact on the plus/minus 15% error rate threshold as desired by the company.  If the threshold is not met, further models will need to be completed and if the threshold is met, than no further work is required.