Factor Analysis to Identify Sectors

Factor Analysis

Introduction

Utilizing a stock portfolio data set and a factor analysis to identify sectors in the stock market, we will transform the variables into log values to explain the variation in the log-returns of the stocks and market index.  We will begin the factor analysis by performing a Principal Factor Analysis without a factor rotation and then with the application of a varimax rotation.  Next, apply a maximum likelihood estimation to estimate the common factors with a varimax rotation.  Finally, similar to before, a maximum likelihood analysis with a varimax rotation will be conducted but with a ‘max’ argument for the ‘priors’ option in SAS will be selected with the outputs being analyzed and compared against.

Step 1:  Data & Transformation of Data

Using the dataset, stock_portfolio we can view the daily closing price for 12 stocks and indexed funds from Vanguard(VV) beginning Jan 3, 2012 to Dec 31, 2013.  New variables are created to normalize the data and to calculate the daily return for each stock by using the log function for example, return_AA  = log(AA/lag1(AA)) with lag being the difference of one day between the closing prices. The response or dependent variable, ‘VV’ is the variation in the log-returns of the market index which allows us to explain the variation on the individual stock returns.  If we did transform the stock prices, the stocks would be at their closing dollar value which would not provide a basis for comparison on each stocks daily return as the ‘measuring stick’ would vary by stock.

Using log-returns of each stock we will explain the variation in the log-returns of the market index using principal components analysis.

The results show 502 observations, 14 variables.  A quick overview:

Obs Date BAC BHI CVX DD VV return_BAC return_BHI return_CVX return_DD response_VV
1 03JAN2012 5.8 51.02 110.37 46.51 58.18 . . . . .
2 04JAN2012 5.81 51.53 110.18 47.02 58.25 0.001723 0.009946 -0.001723 0.010906 0.001202439
3 05JAN2012 6.31 50.82 109.1 46.7 58.44 0.082555 -0.013874 -0.009850 -0.006829 0.003256494
4 06JAN2012 6.18 51.26 108.31 46.04 58.32 -0.020817 0.008621 -0.007267 -0.014234 -.002055499
5 09JAN2012 6.27 51.58 109.49 46.43 58.45 0.014458 0.006223 0.010836 0.008435 0.002226600

A snippet of the log returns

Obs return_BAC return_BHI return_CVX return_DD return_DOW return_HAL return_HES return_HUN return_JPM
1 . . . . . . . . .
2 0.001723 0.009946 -0.001723 0.010906 0.005357 0.028008 0.010222 -0.008073 -0.000858
3 0.082555 -0.013874 -0.009850 -0.006829 0.006324 -0.016074 -0.024015 -0.005079 0.020672
4 -0.020817 0.008621 -0.007267 -0.014234 0.005954 0.012080 -0.020699 0.008114 -0.009009
5 0.014458 0.006223 0.010836 0.008435 -0.000330 0.011370 0.008472 -0.006079 -0.001698

Step 2:  Principal Factor Analysis without Rotation

The principal factor analysis is similar to the principal components analysis except that it does not work off of the covariance matrix of the observed variables but on the reduced covariance matrix which is a diagonal matrix with entries that are estimates of the specific variances.  Additionally, factor analysis does not try to account for all of the observed variances but only that which is shared through the common factors.  Accounting for covariances or correlations between the manifest variables rather than variances is the focus in principal factor analysis (Everitt, B.  Multivariable Modeling and Multivariate Analysis for the Behavioral Sciences. P. 216).  Performing a principal Factor Analysis using SAS results with the automatic number of factors to retain.

Eigenvalues of the Reduced Correlation Matrix: Total = 6.86244298  Average = 0.57187025
  Eigenvalue Difference Proportion Cumulative
1 6.04732583 5.16261770 0.8812 0.8812
2 0.88470813 0.52262870 0.1289 1.0101
3 0.36207942 0.05735386 0.0528 1.0629
4 0.30472556 0.29429115 0.0444 1.1073
5 0.01043441 0.06365245 0.0015 1.1088
6 -.05321803 0.01517115 -0.0078 1.1011
7 -.06838918 0.03291807 -0.0100 1.0911
8 -.10130725 0.01600696 -0.0148 1.0763
9 -.11731422 0.00866270 -0.0171 1.0593
10 -.12597692 0.01040221 -0.0184 1.0409
11 -.13637913 0.00786652 -0.0199 1.0210
12 -.14424565 -0.0210 1.0000

Based on the proportion criteria and the dataset, two factors will be retained.

For reference, the prior communality estimates are the square multiple correlations and are usually less than one and often referred to as the reduced correlation matrix.  Generally, if the square multiple correlations are large, one would expect to see the principal factor analysis and the principal component analysis to be alike.

As we can, the eigenvalues based on the first two variables have a cumulative value of 1.0101.  This indicates that the variables without our model are all highly correlated with the other variables within the model.

Factor Analysis Scree and Variance Explained

Factor Analysis Scree and Variance Explained

Retaining only two factors, we can see by the Scree plot that we have reached the elbow portion of the line with an abrupt bend.  Additionally, the Variance Explained plot has a cumulative variance explained that exceeds 100% which seems unusual but is not in error due to the fact that the reduced correlation matrix does not necessarily a positive definite matrix.  Additionally, as you can see eigenvalues for the matrix can be negative.  The cumulative proportions also flattens out after the first two factors reflecting that more than two variables are not necessary as we have sufficient variables to support our hypothesis.

SAS usually chooses to keep the number of components where the eigenvalue is greater than one, often referred to as the ‘eigenvalue-one criterion’.  However, SAS also evaluates the proportion of variance accounted for and retains any component that is at least 5% or 10% of the total variance.  The proportion formula is:

Proportion = Eigenvalue for the component of interest / Total eigenvalues of the correlation matrix.  Hence, our second component is .884/6.86 = 12.89% and is retained.

Factor Pattern
  Factor1 Factor2
return_BAC 0.68475 0.36021
return_BHI 0.69984 -0.39498
return_CVX 0.77402 -0.10833
return_DD 0.71605 0.16703
return_DOW 0.64548 0.19801
return_HAL 0.72630 -0.38221
return_HES 0.70361 -0.15709
return_HUN 0.58030 0.18186
return_JPM 0.67874 0.34813
return_SLB 0.79382 -0.30815
return_WFC 0.72445 0.30517
return_XOM 0.76500 -0.08361

Unrotated Factor Pattern

Factor Analysis unrotated

Factor Analysis Path Diagram

Factor Analysis Path Diagram

As shown in the initial Factor Pattern plot, the unrotated factors have one big cluster of variables with some being positive in quadrant 1 and others being negative in quadrant 4.  The factor loadings of 87.24% on Factor 1 and 12.76% on factor 2 making factor 1 the dominant factor.  Ideally, the factor loadings would be balanced.    The above also reflects that factor 1 is highly loaded with little variance and factor 2 is lowly loaded.

Factor Pattern

Variance Explained by Each Factor
Factor1 Factor2
6.0473258 0.8847081

Evaluating the factors solution, the largest loading is return_CVX with the smallest being return_HUN and on factor 2 the largest is return_BAC and the smallest has a negative loading value for return_BHI.

The variance explained by each factor is as follows:

As per R.J. Rummel (2017), a simple factor structure has a number of characteristics to evaluate including

  • Each variable is identified with one or a small proportion of factors and therefore, account for the variation of distinct groups of variables
  • The number of variables that have a high loading on a factor is minimized. If it is rotated tries to define a small number of distinct clusters
  • The model should be simplified and if simpler factors can be used then the principle of parsimony should be maximized by grouping factors uses in different sets of variables
  • The goal is to generalize factor results with the unrotated factor solution depending on all of the variables and the unrotated solution being adjusted so that the factors will be invariant of the variables selected – that is, the factor solution will delineate the same clusters of relationships regardless of the extraneous variables included in the analysis.
  • In an orthogonal simple structure rotation, the more correlated the clusters are the more difficult it is to rotate them to so they can be discriminated amongst the clusters. Thus, simple structure can only be approximated but not achieved.

Based on the above, the common factors do not exhibit a simple factor structure.

Step 3:  Varimax Rotation to the Principal Factor Analysis

One type of rotation is a varimax rotation which is an orthogonal rotation, meaning it results in uncorrelated components.  Varimax rotation tends to maximize the variance of a column, instead of the row, of the factor pattern.  Applying the Varimax pre-rotation to the principal factor analysis reflects the loading pattern on the matrix which is postmultiped by an orthogonal transformation matrix.  This results in retaining two factors.  The output is as follows:

Orthogonal Transformation Matrix
  1 2
1 0.70781 0.70640
2 0.70640 -0.70781
Variance Explained by Each Factor
Factor1 Factor2
3.4711423 3.4608916
Rotated Factor Pattern
  Factor1 Factor2
return_BAC 0.73912 0.22875
return_BHI 0.21634 0.77394
return_CVX 0.47133 0.62344
return_DD 0.62482 0.38759
return_DOW 0.59675 0.31582
return_HAL 0.24408 0.78359
return_HES 0.38705 0.60822
return_HUN 0.53921 0.28120
return_JPM 0.72634 0.23305
return_SLB 0.34419 0.77886
return_WFC 0.72835 0.29575
return_XOM 0.48241 0.59958
 Final Communality Estimates: Total = 6.932034
return_BAC return_BHI return_CVX return_DD return_DOW return_HAL return_HES
0.59863104 0.64577915 0.61083713 0.54062043 0.45584934 0.67359085 0.51974549

 

return_HUN return_JPM return_SLB return_WFC return_XOM
0.36982204 0.58188382 0.72509913 0.61795857 0.59221697

The rotated factor pattern can be interpreted as the factor loadings matrix containing correlations with return_BAC having a high correlation of .73912 with factor 1 and a low correlation of .22875 with factor2.   As we can see the rotated factor pattern, the transformation leads to large loadings on return_WFC, return_JPM, return_BAC, return_DD on the first factor.  Similarly, large loadings on return_HAL, return_SLB, return_BHI based on factor 2.  Additionally, when compared to the correlations without a factor rotation, the values have also shifted as expected due to the rotation.

Factor Analysis

Factor Analysis

The components that had the most significant change in the rotation are return_BHI, return_HAL, return_HES, return_SLB, return_CVX and return_XOM.  As per the simple structure characteristics listed above, we have a simple structure from our rotation as the stock clusters relate to the industry sector with factor 1 containing a cluster of banking stocks and factor2 with a cluster of oil stocks.  Additionally, the interpretability using the factor rotation is significantly easier to understand with all of the variables in the second quadrant with positive values instead of a mix of positive and negative values and naturally, the clustering of industry sectors.

The path diagram as shown above, is a graphical representation of the observed variables (shown in rectangular boxes) and the unobserved variable (shown in the oval) along with the associated loadings.  The arrow above the rectangles on each observed value reflects the uniqueness which is the amount of variance in the stock that is not explained – it is simple to calculate 1 – the explained variance.  For example, return_HAL would be 1-.67359 (from the Final Communality Estimates) = .33.  The Factor 1 and 2 and independent as represented with the double-sided arrow and the value of 1.  As you can see, not every factor has an arrow to an observed variable.

Step 4:  Maximum Likelihood Factor Analysis with Varimax Rotation

The maximum likelihood estimation approach is where the expected relations between the factors and the endogenous variables are explicitly stated and the goodness-of-fit criteria are applied to determine if smaller coefficients can be removed.  The process is repeated until the fit improves but is not rotated.  Using the maximum likelihood estimation to estimate the common factors with the Varimax rotation, the same number of factors as the Principal Factor Analysis are retained – two.  Similar to before, the prior communality estimates are:

Prior Communality Estimates: SMC
return_BAC return_BHI return_CVX return_DD return_DOW return_HAL return_HES return_HUN return_JPM return_SLB return_WFC return_XOM
0.58577906 0.61046627 0.64179539 0.54681402 0.47197670 0.64986770 0.49976057 0.39225125 0.58034671 0.68269067 0.57372531 0.62696933

The eigenvalues are:

Preliminary Eigenvalues: Total = 16.9350893  Average = 1.41125745
  Eigenvalue Difference Proportion Cumulative
1 14.9446192 12.7338755 0.8825 0.8825
2 2.2107436 1.3691513 0.1305 1.0130
3 0.8415924 0.1303533 0.0497 1.0627
4 0.7112391 0.6898126 0.0420 1.1047
5 0.0214265 0.1469738 0.0013 1.1060
6 -0.1255473 0.0709849 -0.0074 1.0986
7 -0.1965322 0.0239638 -0.0116 1.0869
8 -0.2204960 0.0345938 -0.0130 1.0739
9 -0.2550898 0.0286557 -0.0151 1.0589
10 -0.2837455 0.0587253 -0.0168 1.0421
11 -0.3424708 0.0281790 -0.0202 1.0219
12 -0.3706498 -0.0219 1.0000

Goodness-of-Fit Statistics

The ML factor analysis provides a lot more statistical information on the null and alternative hypotheses than the Principal Factor Analysis.  Unlike the varimax rotation the maximum likelihood factor analysis with varimax rotation provides hypothesis testing from both a macro and micro perspective along with goodness-of-fit statistics:

Significance Tests Based on 501 Observations
Test DF Chi-Square Pr >
ChiSq
H0: No common factors 66 3656.2617 <.0001
HA: At least one common factor
H0: 2 Factors are sufficient 43 319.3192 <.0001
HA: More factors are needed
Chi-Square without Bartlett’s Correction 323.30664
Akaike’s Information Criterion 237.30664
Schwarz’s Bayesian Criterion 55.99257
Tucker and Lewis’s Reliability Coefficient 0.88187
Squared Canonical Correlations
Factor1 Factor2
0.94176593 0.73146692

As reflected above, we reject both null hypotheses.  Additionally, we can view and compare the Akaike’s Information Criteria (AIC) and that 2 factors are sufficient and the Bayesian Information Criteria (BIC) values where models with the lowest value is preferred (BIC and AIC penalize the model based on the number of parameters in the model)

Eigenvalues of the Weighted Reduced Correlation Matrix: Total = 18.8960127  Average = 1.57466772
  Eigenvalue Difference Proportion Cumulative
1 16.1720778 13.4481419 0.8558 0.8558
2 2.7239360 1.9046476 0.1442 1.0000
3 0.8192884 0.2494707 0.0434 1.0434
4 0.5698176 0.4658764 0.0302 1.0735
5 0.1039412 0.1141060 0.0055 1.0790
6 -0.0101647 0.0295222 -0.0005 1.0785
7 -0.0396869 0.0990329 -0.0021 1.0764
8 -0.1387198 0.1324675 -0.0073 1.0690
9 -0.2711873 0.0110888 -0.0144 1.0547
10 -0.2822761 0.0481722 -0.0149 1.0397
11 -0.3304483 0.0901170 -0.0175 1.0223
12 -0.4205653 -0.0223 1.0000
Rotated Factor Pattern
  Factor1 Factor2
return_BAC 0.76122 0.21969
return_BHI 0.21664 0.79932
return_CVX 0.49806 0.57530
return_DD 0.59542 0.38748
return_DOW 0.56395 0.31884
return_HAL 0.24256 0.80907
return_HES 0.40289 0.59153
return_HUN 0.50588 0.29457
return_JPM 0.75054 0.22277
return_SLB 0.35223 0.79376
return_WFC 0.75994 0.27534
return_XOM 0.51113 0.55362
Variance Explained by Each Factor
Factor Weighted Unweighted
Factor1 8.7156851 3.55022275
Factor2 10.1803287 3.42320994

The same number of common factors are suggested by SAS regardless of whether the maximum likelihood method or the principal factor analysis is used.  Additionally, the factor analysis differences are very minor and the goodness-of-fit criteria are provided and can be easily assessed and compared.

From above, the eigenvalues decrease quickly reflecting that the loadings after the first two variables are small and likely not important. The variance Explained by Each Factor is balanced which is also similar to before.  Additionally, as seen below, the Rotated Factor Pattern weighting has negligible change with factor 1 increasing from 50. 07% to 50.91% and factor 2 from 49.93% to 49.09%.

Step 5:  Maximum Likelihood Factor Analysis with Varimax Rotation and MAX argument for the PRIORS option

Unlike Step 4, using the MAX argument with the PRIORS option provides two different outputs – one with five factors and the second with four factors.

Five Factors:

Combining the output of the different factors with the sector information and highlighting the largest factors while relating them to the sectors it become very apparent of the grouping.  However, factor 5 does not have any clear value based on its loading criteria, does not appear to be grouping any of the factors, highest value is .2490 which is not significant and we see a mix of positive and negative values. Additionally, it does not have any high or low factor loadings making it difficult to interpret.

Factor Analysis - Five Factors

Factor Analysis – Five Factors

The variance explained by each factor is:

Variance Explained by Each Factor
Factor Weighted Unweighted
Factor1 9.48177257 2.55119512
Factor2 6.95572063 2.08400430
Factor3 5.26449075 1.82173920
Factor4 5.80237050 1.59069819
Factor5 0.31984016 0.12246466

The low weighted value of factor 5 is not surprising based on the above table and thus, not likely important.

For interest sake, each of the factors and weightings can be viewed graphically:

As we can see, Factor 5 does not have a cluster of very high or low factor loadings making it difficult to interpret.

Four Factors

Similar to the five factors, evaluating the factor pattern by industry we again see grouping similar to the five factors except that we only have four factors and the fifth category (which was meaningless) has been eliminated.

Factor Analysis - Four Factors

Factor Analysis – Four Factors

 

As we would expect the weighting are not significantly different except that factor 5 has been eliminated.

Variance Explained by Each Factor
Factor Weighted Unweighted
Factor1 9.39289839 2.57196785
Factor2 6.69316710 2.05723775
Factor3 5.10330088 1.83137584
Factor4 5.71876627 1.56282992

Comparing the four factors to the five factors, the four factors are significantly easier to interpret without factor 5.

Comparing five factors to four factors, SAS provides us information that the convergence criterion has been satisfied, both models are valid factor analysis.

Significance Tests Based on 501 Observations
Test DF Chi-Square Pr >
ChiSq
H0: No common factors 66 3656.2617 <.0001
HA: At least one common factor
H0: 4 Factors are sufficient 24 21.2978 0.6211
HA: More factors are needed
Significance Tests Based on 501 Observations
Test DF Chi-Square Pr >
ChiSq
H0: No common factors 66 3656.2617 <.0001
HA: At least one common factor
H0: 5 Factors are sufficient 16 10.9169 0.8146
HA: More factors are needed

Based on the above, fail to reject the null hypothesis.

By forcing SAS to use only four factors, we have better loadings and increased interpretability. Thus, the four factors produce a valid factor analysis as the proportion of variance accounted for has been maximized with fewer variables, the variables share a conceptual meaning and are tied to the industry sector and finally, the rotated factors reflect a simple structure.

This suggests that there is a sensitivity of common factor estimation to the prior estimates of the communalities and the effect of only having a small amount of data which increases the sensitivity of the estimates and weights (SAS/ETS(R) 13.1 User’s Guide. The ENTROPY Procedure, 10 Apr. 2014. Web. 14 May 2017).   Having additional data available for estimation would reduce this sensitivity.

Three Factors

For comparison, forcing SAS to use three factors produces the following results:

Significance Tests Based on 501 Observations
Test DF Chi-Square Pr >
ChiSq
H0: No common factors 66 3656.2617 <.0001
HA: At least one common factor
H0: 3 Factors are sufficient 33 155.9884 <.0001
HA: More factors are needed
Chi-Square without Bartlett’s Correction 158.14977
Akaike’s Information Criterion 92.14977
Schwarz’s Bayesian Criterion -46.99823
Tucker and Lewis’s Reliability Coefficient 0.93149

Based on the P Value, the null hypothesis would be rejected and three factors are sufficient.

Conclusion

In conclusion, utilizing a stock portfolio data set and factor analysis to identify sectors in the stock market, we transformed log values to explain the variation in the log-returns of the stocks and created a market index.  Initially, a Principal Factor Analysis was conducted without a factor rotation which resulted in SAS retaining two factors.  While the eigenvalues exceeded 100% which was not in error, a scree and variance explained plot were provided to visualize the results along with an initial factor pattern plot which was difficult to interpret due to the high loading of factor 1, mixed values and lack of identifiable clusters.  Characteristics of a simple factor structure were outlined to conclude that the unrotated principal factor analysis was not a simple factor structure.

Next, a varimax rotation was applied and while the number of factors selected by SAS remained the same at two, the correlation values shifted and the rotated factor pattern was easier to interpret with the clusters relating to the different sectors.  Additionally, a path diagram was provided for a different perspective on the explained variance and uniqueness.  A maximum likelihood estimation was also provided which again, resulted in retaining two factors but with additional information on the goodness-of-fit statistics including Akaike’s Information Criterion (AIC) and Schwarz’s Bayesian Criterion (BIC) that could be used as a basis of comparison.  When compared to the varimax rotation the changes in the various loadings and eigenvalues were negligible.

Finally, a maximum likelihood analysis with a varimax rotation was again conducted but with a ‘max’ argument for the ‘priors’ option in SAS.  This resulted in five factors and four factors being selected.  The five factors were interesting as the Factor 5 did not have the typical loading values and did not seem to relate to a sector unlike the other factors.  For visual purposes, various plots were provided amongst the different factors including Factor 5 to see how it weighted with other factors.  Four factors were similar to the five factors but eliminated the Factor 5 making it easy to correspond each of the four factors to four industry sectors.  The Goodness-of-Fit statistics were provided for both the five and four factors for ease of comparison but failing to reject the null hypothesis on both the macro, with no common factors, and micro perspective, with 4 or 5 factors being sufficient.  Finally, for comparison purposes, three factors were provided where the null hypothesis of a p value of <.0001 were met and thus, would not reject the null hypothesis.

 

References

Everitt, B.S. (2010). Multivariable Modeling and Multivariate Analysis for the Behavioral Sciences.  CRC Press.  USA.

Rummel, R.J. (2017).  Understanding Factor Analysis.  Retrieved from https://www.hawaii.edu/powerkills/UFA.HTM

SAS/ETS(R) 13.1 User’s Guide. The ENTROPY Procedure, 10 Apr. 2014. Web. 14 May 2017.  Retrieved from:  http://support.sas.com/documentation/cdl/en/etsug/66840/HTML/default/viewer.htm#etsug_entropy_gettingstarted03.htm

Leave a Reply