# Factor Analysis

## Introduction

Utilizing a stock portfolio data set and a factor analysis to identify sectors in the stock market, we will transform the variables into log values to explain the variation in the log-returns of the stocks and market index.  We will begin the factor analysis by performing a Principal Factor Analysis without a factor rotation and then with the application of a varimax rotation.  Next, apply a maximum likelihood estimation to estimate the common factors with a varimax rotation.  Finally, similar to before, a maximum likelihood analysis with a varimax rotation will be conducted but with a ‘max’ argument for the ‘priors’ option in SAS will be selected with the outputs being analyzed and compared against.

## Step 1:  Data & Transformation of Data

Using the dataset, stock_portfolio we can view the daily closing price for 12 stocks and indexed funds from Vanguard(VV) beginning Jan 3, 2012 to Dec 31, 2013.  New variables are created to normalize the data and to calculate the daily return for each stock by using the log function for example, return_AA  = log(AA/lag1(AA)) with lag being the difference of one day between the closing prices. The response or dependent variable, ‘VV’ is the variation in the log-returns of the market index which allows us to explain the variation on the individual stock returns.  If we did transform the stock prices, the stocks would be at their closing dollar value which would not provide a basis for comparison on each stocks daily return as the ‘measuring stick’ would vary by stock.

Using log-returns of each stock we will explain the variation in the log-returns of the market index using principal components analysis.

The results show 502 observations, 14 variables.  A quick overview:

 Obs Date BAC BHI CVX DD VV return_BAC return_BHI return_CVX return_DD response_VV 1 03JAN2012 5.8 51.02 110.37 46.51 58.18 . . . . . 2 04JAN2012 5.81 51.53 110.18 47.02 58.25 0.001723 0.009946 -0.001723 0.010906 0.001202439 3 05JAN2012 6.31 50.82 109.1 46.7 58.44 0.082555 -0.013874 -0.009850 -0.006829 0.003256494 4 06JAN2012 6.18 51.26 108.31 46.04 58.32 -0.020817 0.008621 -0.007267 -0.014234 -.002055499 5 09JAN2012 6.27 51.58 109.49 46.43 58.45 0.014458 0.006223 0.010836 0.008435 0.002226600

A snippet of the log returns

 Obs return_BAC return_BHI return_CVX return_DD return_DOW return_HAL return_HES return_HUN return_JPM 1 . . . . . . . . . 2 0.001723 0.009946 -0.001723 0.010906 0.005357 0.028008 0.010222 -0.008073 -0.000858 3 0.082555 -0.013874 -0.009850 -0.006829 0.006324 -0.016074 -0.024015 -0.005079 0.020672 4 -0.020817 0.008621 -0.007267 -0.014234 0.005954 0.012080 -0.020699 0.008114 -0.009009 5 0.014458 0.006223 0.010836 0.008435 -0.000330 0.011370 0.008472 -0.006079 -0.001698

## Step 2:  Principal Factor Analysis without Rotation

The principal factor analysis is similar to the principal components analysis except that it does not work off of the covariance matrix of the observed variables but on the reduced covariance matrix which is a diagonal matrix with entries that are estimates of the specific variances.  Additionally, factor analysis does not try to account for all of the observed variances but only that which is shared through the common factors.  Accounting for covariances or correlations between the manifest variables rather than variances is the focus in principal factor analysis (Everitt, B.  Multivariable Modeling and Multivariate Analysis for the Behavioral Sciences. P. 216).  Performing a principal Factor Analysis using SAS results with the automatic number of factors to retain.

 Eigenvalues of the Reduced Correlation Matrix: Total = 6.86244298  Average = 0.57187025 Eigenvalue Difference Proportion Cumulative 1 6.04732583 5.16261770 0.8812 0.8812 2 0.88470813 0.52262870 0.1289 1.0101 3 0.36207942 0.05735386 0.0528 1.0629 4 0.30472556 0.29429115 0.0444 1.1073 5 0.01043441 0.06365245 0.0015 1.1088 6 -.05321803 0.01517115 -0.0078 1.1011 7 -.06838918 0.03291807 -0.0100 1.0911 8 -.10130725 0.01600696 -0.0148 1.0763 9 -.11731422 0.00866270 -0.0171 1.0593 10 -.12597692 0.01040221 -0.0184 1.0409 11 -.13637913 0.00786652 -0.0199 1.0210 12 -.14424565 -0.0210 1.0000

Based on the proportion criteria and the dataset, two factors will be retained.

For reference, the prior communality estimates are the square multiple correlations and are usually less than one and often referred to as the reduced correlation matrix.  Generally, if the square multiple correlations are large, one would expect to see the principal factor analysis and the principal component analysis to be alike.

As we can, the eigenvalues based on the first two variables have a cumulative value of 1.0101.  This indicates that the variables without our model are all highly correlated with the other variables within the model.

Retaining only two factors, we can see by the Scree plot that we have reached the elbow portion of the line with an abrupt bend.  Additionally, the Variance Explained plot has a cumulative variance explained that exceeds 100% which seems unusual but is not in error due to the fact that the reduced correlation matrix does not necessarily a positive definite matrix.  Additionally, as you can see eigenvalues for the matrix can be negative.  The cumulative proportions also flattens out after the first two factors reflecting that more than two variables are not necessary as we have sufficient variables to support our hypothesis.

SAS usually chooses to keep the number of components where the eigenvalue is greater than one, often referred to as the ‘eigenvalue-one criterion’.  However, SAS also evaluates the proportion of variance accounted for and retains any component that is at least 5% or 10% of the total variance.  The proportion formula is:

Proportion = Eigenvalue for the component of interest / Total eigenvalues of the correlation matrix.  Hence, our second component is .884/6.86 = 12.89% and is retained.

 Factor Pattern Factor1 Factor2 return_BAC 0.68475 0.36021 return_BHI 0.69984 -0.39498 return_CVX 0.77402 -0.10833 return_DD 0.71605 0.16703 return_DOW 0.64548 0.19801 return_HAL 0.72630 -0.38221 return_HES 0.70361 -0.15709 return_HUN 0.58030 0.18186 return_JPM 0.67874 0.34813 return_SLB 0.79382 -0.30815 return_WFC 0.72445 0.30517 return_XOM 0.76500 -0.08361

### Unrotated Factor Pattern As shown in the initial Factor Pattern plot, the unrotated factors have one big cluster of variables with some being positive in quadrant 1 and others being negative in quadrant 4.  The factor loadings of 87.24% on Factor 1 and 12.76% on factor 2 making factor 1 the dominant factor.  Ideally, the factor loadings would be balanced.    The above also reflects that factor 1 is highly loaded with little variance and factor 2 is lowly loaded.

### Factor Pattern

 Variance Explained by Each Factor Factor1 Factor2 6.0473258 0.8847081

Evaluating the factors solution, the largest loading is return_CVX with the smallest being return_HUN and on factor 2 the largest is return_BAC and the smallest has a negative loading value for return_BHI.

The variance explained by each factor is as follows:

As per R.J. Rummel (2017), a simple factor structure has a number of characteristics to evaluate including

• Each variable is identified with one or a small proportion of factors and therefore, account for the variation of distinct groups of variables
• The number of variables that have a high loading on a factor is minimized. If it is rotated tries to define a small number of distinct clusters
• The model should be simplified and if simpler factors can be used then the principle of parsimony should be maximized by grouping factors uses in different sets of variables
• The goal is to generalize factor results with the unrotated factor solution depending on all of the variables and the unrotated solution being adjusted so that the factors will be invariant of the variables selected – that is, the factor solution will delineate the same clusters of relationships regardless of the extraneous variables included in the analysis.
• In an orthogonal simple structure rotation, the more correlated the clusters are the more difficult it is to rotate them to so they can be discriminated amongst the clusters. Thus, simple structure can only be approximated but not achieved.

Based on the above, the common factors do not exhibit a simple factor structure.

## Step 3:  Varimax Rotation to the Principal Factor Analysis

One type of rotation is a varimax rotation which is an orthogonal rotation, meaning it results in uncorrelated components.  Varimax rotation tends to maximize the variance of a column, instead of the row, of the factor pattern.  Applying the Varimax pre-rotation to the principal factor analysis reflects the loading pattern on the matrix which is postmultiped by an orthogonal transformation matrix.  This results in retaining two factors.  The output is as follows:

 Orthogonal Transformation Matrix 1 2 1 0.70781 0.70640 2 0.70640 -0.70781
 Variance Explained by Each Factor Factor1 Factor2 3.4711423 3.4608916
 Rotated Factor Pattern Factor1 Factor2 return_BAC 0.73912 0.22875 return_BHI 0.21634 0.77394 return_CVX 0.47133 0.62344 return_DD 0.62482 0.38759 return_DOW 0.59675 0.31582 return_HAL 0.24408 0.78359 return_HES 0.38705 0.60822 return_HUN 0.53921 0.28120 return_JPM 0.72634 0.23305 return_SLB 0.34419 0.77886 return_WFC 0.72835 0.29575 return_XOM 0.48241 0.59958
 Final Communality Estimates: Total = 6.932034 return_BAC return_BHI return_CVX return_DD return_DOW return_HAL return_HES 0.59863104 0.64577915 0.61083713 0.54062043 0.45584934 0.67359085 0.51974549

 return_HUN return_JPM return_SLB return_WFC return_XOM 0.36982204 0.58188382 0.72509913 0.61795857 0.59221697

The rotated factor pattern can be interpreted as the factor loadings matrix containing correlations with return_BAC having a high correlation of .73912 with factor 1 and a low correlation of .22875 with factor2.   As we can see the rotated factor pattern, the transformation leads to large loadings on return_WFC, return_JPM, return_BAC, return_DD on the first factor.  Similarly, large loadings on return_HAL, return_SLB, return_BHI based on factor 2.  Additionally, when compared to the correlations without a factor rotation, the values have also shifted as expected due to the rotation. The components that had the most significant change in the rotation are return_BHI, return_HAL, return_HES, return_SLB, return_CVX and return_XOM.  As per the simple structure characteristics listed above, we have a simple structure from our rotation as the stock clusters relate to the industry sector with factor 1 containing a cluster of banking stocks and factor2 with a cluster of oil stocks.  Additionally, the interpretability using the factor rotation is significantly easier to understand with all of the variables in the second quadrant with positive values instead of a mix of positive and negative values and naturally, the clustering of industry sectors.

The path diagram as shown above, is a graphical representation of the observed variables (shown in rectangular boxes) and the unobserved variable (shown in the oval) along with the associated loadings.  The arrow above the rectangles on each observed value reflects the uniqueness which is the amount of variance in the stock that is not explained – it is simple to calculate 1 – the explained variance.  For example, return_HAL would be 1-.67359 (from the Final Communality Estimates) = .33.  The Factor 1 and 2 and independent as represented with the double-sided arrow and the value of 1.  As you can see, not every factor has an arrow to an observed variable.

## Step 4:  Maximum Likelihood Factor Analysis with Varimax Rotation

The maximum likelihood estimation approach is where the expected relations between the factors and the endogenous variables are explicitly stated and the goodness-of-fit criteria are applied to determine if smaller coefficients can be removed.  The process is repeated until the fit improves but is not rotated.  Using the maximum likelihood estimation to estimate the common factors with the Varimax rotation, the same number of factors as the Principal Factor Analysis are retained – two.  Similar to before, the prior communality estimates are:

 Prior Communality Estimates: SMC return_BAC return_BHI return_CVX return_DD return_DOW return_HAL return_HES return_HUN return_JPM return_SLB return_WFC return_XOM 0.58577906 0.61046627 0.64179539 0.54681402 0.47197670 0.64986770 0.49976057 0.39225125 0.58034671 0.68269067 0.57372531 0.62696933

The eigenvalues are:

 Preliminary Eigenvalues: Total = 16.9350893  Average = 1.41125745 Eigenvalue Difference Proportion Cumulative 1 14.9446192 12.7338755 0.8825 0.8825 2 2.2107436 1.3691513 0.1305 1.0130 3 0.8415924 0.1303533 0.0497 1.0627 4 0.7112391 0.6898126 0.0420 1.1047 5 0.0214265 0.1469738 0.0013 1.1060 6 -0.1255473 0.0709849 -0.0074 1.0986 7 -0.1965322 0.0239638 -0.0116 1.0869 8 -0.2204960 0.0345938 -0.0130 1.0739 9 -0.2550898 0.0286557 -0.0151 1.0589 10 -0.2837455 0.0587253 -0.0168 1.0421 11 -0.3424708 0.0281790 -0.0202 1.0219 12 -0.3706498 -0.0219 1.0000

### Goodness-of-Fit Statistics

The ML factor analysis provides a lot more statistical information on the null and alternative hypotheses than the Principal Factor Analysis.  Unlike the varimax rotation the maximum likelihood factor analysis with varimax rotation provides hypothesis testing from both a macro and micro perspective along with goodness-of-fit statistics:

 Significance Tests Based on 501 Observations Test DF Chi-Square Pr > ChiSq H0: No common factors 66 3656.2617 <.0001 HA: At least one common factor H0: 2 Factors are sufficient 43 319.3192 <.0001 HA: More factors are needed
 Chi-Square without Bartlett’s Correction 323.307 Akaike’s Information Criterion 237.307 Schwarz’s Bayesian Criterion 55.9926 Tucker and Lewis’s Reliability Coefficient 0.88187
 Squared Canonical Correlations Factor1 Factor2 0.94176593 0.73146692

As reflected above, we reject both null hypotheses.  Additionally, we can view and compare the Akaike’s Information Criteria (AIC) and that 2 factors are sufficient and the Bayesian Information Criteria (BIC) values where models with the lowest value is preferred (BIC and AIC penalize the model based on the number of parameters in the model)

 Eigenvalues of the Weighted Reduced Correlation Matrix: Total = 18.8960127  Average = 1.57466772 Eigenvalue Difference Proportion Cumulative 1 16.1720778 13.4481419 0.8558 0.8558 2 2.7239360 1.9046476 0.1442 1.0000 3 0.8192884 0.2494707 0.0434 1.0434 4 0.5698176 0.4658764 0.0302 1.0735 5 0.1039412 0.1141060 0.0055 1.0790 6 -0.0101647 0.0295222 -0.0005 1.0785 7 -0.0396869 0.0990329 -0.0021 1.0764 8 -0.1387198 0.1324675 -0.0073 1.0690 9 -0.2711873 0.0110888 -0.0144 1.0547 10 -0.2822761 0.0481722 -0.0149 1.0397 11 -0.3304483 0.0901170 -0.0175 1.0223 12 -0.4205653 -0.0223 1.0000
 Rotated Factor Pattern Factor1 Factor2 return_BAC 0.76122 0.21969 return_BHI 0.21664 0.79932 return_CVX 0.49806 0.57530 return_DD 0.59542 0.38748 return_DOW 0.56395 0.31884 return_HAL 0.24256 0.80907 return_HES 0.40289 0.59153 return_HUN 0.50588 0.29457 return_JPM 0.75054 0.22277 return_SLB 0.35223 0.79376 return_WFC 0.75994 0.27534 return_XOM 0.51113 0.55362
 Variance Explained by Each Factor Factor Weighted Unweighted Factor1 8.7156851 3.55022275 Factor2 10.1803287 3.42320994

The same number of common factors are suggested by SAS regardless of whether the maximum likelihood method or the principal factor analysis is used.  Additionally, the factor analysis differences are very minor and the goodness-of-fit criteria are provided and can be easily assessed and compared.

From above, the eigenvalues decrease quickly reflecting that the loadings after the first two variables are small and likely not important. The variance Explained by Each Factor is balanced which is also similar to before.  Additionally, as seen below, the Rotated Factor Pattern weighting has negligible change with factor 1 increasing from 50. 07% to 50.91% and factor 2 from 49.93% to 49.09%.

## Step 5:  Maximum Likelihood Factor Analysis with Varimax Rotation and MAX argument for the PRIORS option

Unlike Step 4, using the MAX argument with the PRIORS option provides two different outputs – one with five factors and the second with four factors.

### Five Factors:

Combining the output of the different factors with the sector information and highlighting the largest factors while relating them to the sectors it become very apparent of the grouping.  However, factor 5 does not have any clear value based on its loading criteria, does not appear to be grouping any of the factors, highest value is .2490 which is not significant and we see a mix of positive and negative values. Additionally, it does not have any high or low factor loadings making it difficult to interpret.

The variance explained by each factor is:

 Variance Explained by Each Factor Factor Weighted Unweighted Factor1 9.48177257 2.55119512 Factor2 6.95572063 2.08400430 Factor3 5.26449075 1.82173920 Factor4 5.80237050 1.59069819 Factor5 0.31984016 0.12246466

The low weighted value of factor 5 is not surprising based on the above table and thus, not likely important.

For interest sake, each of the factors and weightings can be viewed graphically:

As we can see, Factor 5 does not have a cluster of very high or low factor loadings making it difficult to interpret.

### Four Factors

Similar to the five factors, evaluating the factor pattern by industry we again see grouping similar to the five factors except that we only have four factors and the fifth category (which was meaningless) has been eliminated.

As we would expect the weighting are not significantly different except that factor 5 has been eliminated.

 Variance Explained by Each Factor Factor Weighted Unweighted Factor1 9.39289839 2.57196785 Factor2 6.69316710 2.05723775 Factor3 5.10330088 1.83137584 Factor4 5.71876627 1.56282992

Comparing the four factors to the five factors, the four factors are significantly easier to interpret without factor 5.

Comparing five factors to four factors, SAS provides us information that the convergence criterion has been satisfied, both models are valid factor analysis.

 Significance Tests Based on 501 Observations Test DF Chi-Square Pr > ChiSq H0: No common factors 66 3656.2617 <.0001 HA: At least one common factor H0: 4 Factors are sufficient 24 21.2978 0.6211 HA: More factors are needed
 Significance Tests Based on 501 Observations Test DF Chi-Square Pr > ChiSq H0: No common factors 66 3656.2617 <.0001 HA: At least one common factor H0: 5 Factors are sufficient 16 10.9169 0.8146 HA: More factors are needed

Based on the above, fail to reject the null hypothesis.

By forcing SAS to use only four factors, we have better loadings and increased interpretability. Thus, the four factors produce a valid factor analysis as the proportion of variance accounted for has been maximized with fewer variables, the variables share a conceptual meaning and are tied to the industry sector and finally, the rotated factors reflect a simple structure.

This suggests that there is a sensitivity of common factor estimation to the prior estimates of the communalities and the effect of only having a small amount of data which increases the sensitivity of the estimates and weights (SAS/ETS(R) 13.1 User’s Guide. The ENTROPY Procedure, 10 Apr. 2014. Web. 14 May 2017).   Having additional data available for estimation would reduce this sensitivity.

### Three Factors

For comparison, forcing SAS to use three factors produces the following results:

 Significance Tests Based on 501 Observations Test DF Chi-Square Pr > ChiSq H0: No common factors 66 3656.2617 <.0001 HA: At least one common factor H0: 3 Factors are sufficient 33 155.9884 <.0001 HA: More factors are needed
 Chi-Square without Bartlett’s Correction 158.15 Akaike’s Information Criterion 92.1498 Schwarz’s Bayesian Criterion -46.9982 Tucker and Lewis’s Reliability Coefficient 0.93149

Based on the P Value, the null hypothesis would be rejected and three factors are sufficient.

# Conclusion

In conclusion, utilizing a stock portfolio data set and factor analysis to identify sectors in the stock market, we transformed log values to explain the variation in the log-returns of the stocks and created a market index.  Initially, a Principal Factor Analysis was conducted without a factor rotation which resulted in SAS retaining two factors.  While the eigenvalues exceeded 100% which was not in error, a scree and variance explained plot were provided to visualize the results along with an initial factor pattern plot which was difficult to interpret due to the high loading of factor 1, mixed values and lack of identifiable clusters.  Characteristics of a simple factor structure were outlined to conclude that the unrotated principal factor analysis was not a simple factor structure.

Next, a varimax rotation was applied and while the number of factors selected by SAS remained the same at two, the correlation values shifted and the rotated factor pattern was easier to interpret with the clusters relating to the different sectors.  Additionally, a path diagram was provided for a different perspective on the explained variance and uniqueness.  A maximum likelihood estimation was also provided which again, resulted in retaining two factors but with additional information on the goodness-of-fit statistics including Akaike’s Information Criterion (AIC) and Schwarz’s Bayesian Criterion (BIC) that could be used as a basis of comparison.  When compared to the varimax rotation the changes in the various loadings and eigenvalues were negligible.

Finally, a maximum likelihood analysis with a varimax rotation was again conducted but with a ‘max’ argument for the ‘priors’ option in SAS.  This resulted in five factors and four factors being selected.  The five factors were interesting as the Factor 5 did not have the typical loading values and did not seem to relate to a sector unlike the other factors.  For visual purposes, various plots were provided amongst the different factors including Factor 5 to see how it weighted with other factors.  Four factors were similar to the five factors but eliminated the Factor 5 making it easy to correspond each of the four factors to four industry sectors.  The Goodness-of-Fit statistics were provided for both the five and four factors for ease of comparison but failing to reject the null hypothesis on both the macro, with no common factors, and micro perspective, with 4 or 5 factors being sufficient.  Finally, for comparison purposes, three factors were provided where the null hypothesis of a p value of <.0001 were met and thus, would not reject the null hypothesis.

# References

Everitt, B.S. (2010). Multivariable Modeling and Multivariate Analysis for the Behavioral Sciences.  CRC Press.  USA.