Factor Analysis to Identify Sectors
Factor Analysis
Introduction
Utilizing a stock portfolio data set and a factor analysis to identify sectors in the stock market, we will transform the variables into log values to explain the variation in the log-returns of the stocks and market index. We will begin the factor analysis by performing a Principal Factor Analysis without a factor rotation and then with the application of a varimax rotation. Next, apply a maximum likelihood estimation to estimate the common factors with a varimax rotation. Finally, similar to before, a maximum likelihood analysis with a varimax rotation will be conducted but with a ‘max’ argument for the ‘priors’ option in SAS will be selected with the outputs being analyzed and compared against.
Step 1: Data & Transformation of Data
Using the dataset, stock_portfolio we can view the daily closing price for 12 stocks and indexed funds from Vanguard(VV) beginning Jan 3, 2012 to Dec 31, 2013. New variables are created to normalize the data and to calculate the daily return for each stock by using the log function for example, return_AA = log(AA/lag1(AA)) with lag being the difference of one day between the closing prices. The response or dependent variable, ‘VV’ is the variation in the log-returns of the market index which allows us to explain the variation on the individual stock returns. If we did transform the stock prices, the stocks would be at their closing dollar value which would not provide a basis for comparison on each stocks daily return as the ‘measuring stick’ would vary by stock.
Using log-returns of each stock we will explain the variation in the log-returns of the market index using principal components analysis.
The results show 502 observations, 14 variables. A quick overview:
Obs | Date | BAC | BHI | CVX | DD | VV | return_BAC | return_BHI | return_CVX | return_DD | response_VV |
1 | 03JAN2012 | 5.8 | 51.02 | 110.37 | 46.51 | 58.18 | . | . | . | . | . |
2 | 04JAN2012 | 5.81 | 51.53 | 110.18 | 47.02 | 58.25 | 0.001723 | 0.009946 | -0.001723 | 0.010906 | 0.001202439 |
3 | 05JAN2012 | 6.31 | 50.82 | 109.1 | 46.7 | 58.44 | 0.082555 | -0.013874 | -0.009850 | -0.006829 | 0.003256494 |
4 | 06JAN2012 | 6.18 | 51.26 | 108.31 | 46.04 | 58.32 | -0.020817 | 0.008621 | -0.007267 | -0.014234 | -.002055499 |
5 | 09JAN2012 | 6.27 | 51.58 | 109.49 | 46.43 | 58.45 | 0.014458 | 0.006223 | 0.010836 | 0.008435 | 0.002226600 |
A snippet of the log returns
Obs | return_BAC | return_BHI | return_CVX | return_DD | return_DOW | return_HAL | return_HES | return_HUN | return_JPM |
1 | . | . | . | . | . | . | . | . | . |
2 | 0.001723 | 0.009946 | -0.001723 | 0.010906 | 0.005357 | 0.028008 | 0.010222 | -0.008073 | -0.000858 |
3 | 0.082555 | -0.013874 | -0.009850 | -0.006829 | 0.006324 | -0.016074 | -0.024015 | -0.005079 | 0.020672 |
4 | -0.020817 | 0.008621 | -0.007267 | -0.014234 | 0.005954 | 0.012080 | -0.020699 | 0.008114 | -0.009009 |
5 | 0.014458 | 0.006223 | 0.010836 | 0.008435 | -0.000330 | 0.011370 | 0.008472 | -0.006079 | -0.001698 |
Step 2: Principal Factor Analysis without Rotation
The principal factor analysis is similar to the principal components analysis except that it does not work off of the covariance matrix of the observed variables but on the reduced covariance matrix which is a diagonal matrix with entries that are estimates of the specific variances. Additionally, factor analysis does not try to account for all of the observed variances but only that which is shared through the common factors. Accounting for covariances or correlations between the manifest variables rather than variances is the focus in principal factor analysis (Everitt, B. Multivariable Modeling and Multivariate Analysis for the Behavioral Sciences. P. 216). Performing a principal Factor Analysis using SAS results with the automatic number of factors to retain.
Eigenvalues of the Reduced Correlation Matrix: Total = 6.86244298 Average = 0.57187025 | ||||
Eigenvalue | Difference | Proportion | Cumulative | |
1 | 6.04732583 | 5.16261770 | 0.8812 | 0.8812 |
2 | 0.88470813 | 0.52262870 | 0.1289 | 1.0101 |
3 | 0.36207942 | 0.05735386 | 0.0528 | 1.0629 |
4 | 0.30472556 | 0.29429115 | 0.0444 | 1.1073 |
5 | 0.01043441 | 0.06365245 | 0.0015 | 1.1088 |
6 | -.05321803 | 0.01517115 | -0.0078 | 1.1011 |
7 | -.06838918 | 0.03291807 | -0.0100 | 1.0911 |
8 | -.10130725 | 0.01600696 | -0.0148 | 1.0763 |
9 | -.11731422 | 0.00866270 | -0.0171 | 1.0593 |
10 | -.12597692 | 0.01040221 | -0.0184 | 1.0409 |
11 | -.13637913 | 0.00786652 | -0.0199 | 1.0210 |
12 | -.14424565 | -0.0210 | 1.0000 |
Based on the proportion criteria and the dataset, two factors will be retained.
For reference, the prior communality estimates are the square multiple correlations and are usually less than one and often referred to as the reduced correlation matrix. Generally, if the square multiple correlations are large, one would expect to see the principal factor analysis and the principal component analysis to be alike.
As we can, the eigenvalues based on the first two variables have a cumulative value of 1.0101. This indicates that the variables without our model are all highly correlated with the other variables within the model.
Retaining only two factors, we can see by the Scree plot that we have reached the elbow portion of the line with an abrupt bend. Additionally, the Variance Explained plot has a cumulative variance explained that exceeds 100% which seems unusual but is not in error due to the fact that the reduced correlation matrix does not necessarily a positive definite matrix. Additionally, as you can see eigenvalues for the matrix can be negative. The cumulative proportions also flattens out after the first two factors reflecting that more than two variables are not necessary as we have sufficient variables to support our hypothesis.
SAS usually chooses to keep the number of components where the eigenvalue is greater than one, often referred to as the ‘eigenvalue-one criterion’. However, SAS also evaluates the proportion of variance accounted for and retains any component that is at least 5% or 10% of the total variance. The proportion formula is:
Proportion = Eigenvalue for the component of interest / Total eigenvalues of the correlation matrix. Hence, our second component is .884/6.86 = 12.89% and is retained.
Factor Pattern | ||
Factor1 | Factor2 | |
return_BAC | 0.68475 | 0.36021 |
return_BHI | 0.69984 | -0.39498 |
return_CVX | 0.77402 | -0.10833 |
return_DD | 0.71605 | 0.16703 |
return_DOW | 0.64548 | 0.19801 |
return_HAL | 0.72630 | -0.38221 |
return_HES | 0.70361 | -0.15709 |
return_HUN | 0.58030 | 0.18186 |
return_JPM | 0.67874 | 0.34813 |
return_SLB | 0.79382 | -0.30815 |
return_WFC | 0.72445 | 0.30517 |
return_XOM | 0.76500 | -0.08361 |
Unrotated Factor Pattern
As shown in the initial Factor Pattern plot, the unrotated factors have one big cluster of variables with some being positive in quadrant 1 and others being negative in quadrant 4. The factor loadings of 87.24% on Factor 1 and 12.76% on factor 2 making factor 1 the dominant factor. Ideally, the factor loadings would be balanced. The above also reflects that factor 1 is highly loaded with little variance and factor 2 is lowly loaded.
Factor Pattern
Variance Explained by Each Factor | |
Factor1 | Factor2 |
6.0473258 | 0.8847081 |
Evaluating the factors solution, the largest loading is return_CVX with the smallest being return_HUN and on factor 2 the largest is return_BAC and the smallest has a negative loading value for return_BHI.
The variance explained by each factor is as follows:
As per R.J. Rummel (2017), a simple factor structure has a number of characteristics to evaluate including
- Each variable is identified with one or a small proportion of factors and therefore, account for the variation of distinct groups of variables
- The number of variables that have a high loading on a factor is minimized. If it is rotated tries to define a small number of distinct clusters
- The model should be simplified and if simpler factors can be used then the principle of parsimony should be maximized by grouping factors uses in different sets of variables
- The goal is to generalize factor results with the unrotated factor solution depending on all of the variables and the unrotated solution being adjusted so that the factors will be invariant of the variables selected – that is, the factor solution will delineate the same clusters of relationships regardless of the extraneous variables included in the analysis.
- In an orthogonal simple structure rotation, the more correlated the clusters are the more difficult it is to rotate them to so they can be discriminated amongst the clusters. Thus, simple structure can only be approximated but not achieved.
Based on the above, the common factors do not exhibit a simple factor structure.
Step 3: Varimax Rotation to the Principal Factor Analysis
One type of rotation is a varimax rotation which is an orthogonal rotation, meaning it results in uncorrelated components. Varimax rotation tends to maximize the variance of a column, instead of the row, of the factor pattern. Applying the Varimax pre-rotation to the principal factor analysis reflects the loading pattern on the matrix which is postmultiped by an orthogonal transformation matrix. This results in retaining two factors. The output is as follows:
|
|
Final Communality Estimates: Total = 6.932034 | ||||||
return_BAC | return_BHI | return_CVX | return_DD | return_DOW | return_HAL | return_HES |
0.59863104 | 0.64577915 | 0.61083713 | 0.54062043 | 0.45584934 | 0.67359085 | 0.51974549 |
return_HUN | return_JPM | return_SLB | return_WFC | return_XOM |
0.36982204 | 0.58188382 | 0.72509913 | 0.61795857 | 0.59221697 |
The rotated factor pattern can be interpreted as the factor loadings matrix containing correlations with return_BAC having a high correlation of .73912 with factor 1 and a low correlation of .22875 with factor2. As we can see the rotated factor pattern, the transformation leads to large loadings on return_WFC, return_JPM, return_BAC, return_DD on the first factor. Similarly, large loadings on return_HAL, return_SLB, return_BHI based on factor 2. Additionally, when compared to the correlations without a factor rotation, the values have also shifted as expected due to the rotation.
The components that had the most significant change in the rotation are return_BHI, return_HAL, return_HES, return_SLB, return_CVX and return_XOM. As per the simple structure characteristics listed above, we have a simple structure from our rotation as the stock clusters relate to the industry sector with factor 1 containing a cluster of banking stocks and factor2 with a cluster of oil stocks. Additionally, the interpretability using the factor rotation is significantly easier to understand with all of the variables in the second quadrant with positive values instead of a mix of positive and negative values and naturally, the clustering of industry sectors.
The path diagram as shown above, is a graphical representation of the observed variables (shown in rectangular boxes) and the unobserved variable (shown in the oval) along with the associated loadings. The arrow above the rectangles on each observed value reflects the uniqueness which is the amount of variance in the stock that is not explained – it is simple to calculate 1 – the explained variance. For example, return_HAL would be 1-.67359 (from the Final Communality Estimates) = .33. The Factor 1 and 2 and independent as represented with the double-sided arrow and the value of 1. As you can see, not every factor has an arrow to an observed variable.
Step 4: Maximum Likelihood Factor Analysis with Varimax Rotation
The maximum likelihood estimation approach is where the expected relations between the factors and the endogenous variables are explicitly stated and the goodness-of-fit criteria are applied to determine if smaller coefficients can be removed. The process is repeated until the fit improves but is not rotated. Using the maximum likelihood estimation to estimate the common factors with the Varimax rotation, the same number of factors as the Principal Factor Analysis are retained – two. Similar to before, the prior communality estimates are:
Prior Communality Estimates: SMC | |||||||||||
return_BAC | return_BHI | return_CVX | return_DD | return_DOW | return_HAL | return_HES | return_HUN | return_JPM | return_SLB | return_WFC | return_XOM |
0.58577906 | 0.61046627 | 0.64179539 | 0.54681402 | 0.47197670 | 0.64986770 | 0.49976057 | 0.39225125 | 0.58034671 | 0.68269067 | 0.57372531 | 0.62696933 |
The eigenvalues are:
Preliminary Eigenvalues: Total = 16.9350893 Average = 1.41125745 | ||||
Eigenvalue | Difference | Proportion | Cumulative | |
1 | 14.9446192 | 12.7338755 | 0.8825 | 0.8825 |
2 | 2.2107436 | 1.3691513 | 0.1305 | 1.0130 |
3 | 0.8415924 | 0.1303533 | 0.0497 | 1.0627 |
4 | 0.7112391 | 0.6898126 | 0.0420 | 1.1047 |
5 | 0.0214265 | 0.1469738 | 0.0013 | 1.1060 |
6 | -0.1255473 | 0.0709849 | -0.0074 | 1.0986 |
7 | -0.1965322 | 0.0239638 | -0.0116 | 1.0869 |
8 | -0.2204960 | 0.0345938 | -0.0130 | 1.0739 |
9 | -0.2550898 | 0.0286557 | -0.0151 | 1.0589 |
10 | -0.2837455 | 0.0587253 | -0.0168 | 1.0421 |
11 | -0.3424708 | 0.0281790 | -0.0202 | 1.0219 |
12 | -0.3706498 | -0.0219 | 1.0000 |
Goodness-of-Fit Statistics
The ML factor analysis provides a lot more statistical information on the null and alternative hypotheses than the Principal Factor Analysis. Unlike the varimax rotation the maximum likelihood factor analysis with varimax rotation provides hypothesis testing from both a macro and micro perspective along with goodness-of-fit statistics:
|
|
As reflected above, we reject both null hypotheses. Additionally, we can view and compare the Akaike’s Information Criteria (AIC) and that 2 factors are sufficient and the Bayesian Information Criteria (BIC) values where models with the lowest value is preferred (BIC and AIC penalize the model based on the number of parameters in the model)
|
|
The same number of common factors are suggested by SAS regardless of whether the maximum likelihood method or the principal factor analysis is used. Additionally, the factor analysis differences are very minor and the goodness-of-fit criteria are provided and can be easily assessed and compared.
From above, the eigenvalues decrease quickly reflecting that the loadings after the first two variables are small and likely not important. The variance Explained by Each Factor is balanced which is also similar to before. Additionally, as seen below, the Rotated Factor Pattern weighting has negligible change with factor 1 increasing from 50. 07% to 50.91% and factor 2 from 49.93% to 49.09%.
Step 5: Maximum Likelihood Factor Analysis with Varimax Rotation and MAX argument for the PRIORS option
Unlike Step 4, using the MAX argument with the PRIORS option provides two different outputs – one with five factors and the second with four factors.
Five Factors:
Combining the output of the different factors with the sector information and highlighting the largest factors while relating them to the sectors it become very apparent of the grouping. However, factor 5 does not have any clear value based on its loading criteria, does not appear to be grouping any of the factors, highest value is .2490 which is not significant and we see a mix of positive and negative values. Additionally, it does not have any high or low factor loadings making it difficult to interpret.
The variance explained by each factor is:
Variance Explained by Each Factor | ||
Factor | Weighted | Unweighted |
Factor1 | 9.48177257 | 2.55119512 |
Factor2 | 6.95572063 | 2.08400430 |
Factor3 | 5.26449075 | 1.82173920 |
Factor4 | 5.80237050 | 1.59069819 |
Factor5 | 0.31984016 | 0.12246466 |
The low weighted value of factor 5 is not surprising based on the above table and thus, not likely important.
For interest sake, each of the factors and weightings can be viewed graphically:
As we can see, Factor 5 does not have a cluster of very high or low factor loadings making it difficult to interpret.
Four Factors
Similar to the five factors, evaluating the factor pattern by industry we again see grouping similar to the five factors except that we only have four factors and the fifth category (which was meaningless) has been eliminated.
As we would expect the weighting are not significantly different except that factor 5 has been eliminated.
Variance Explained by Each Factor | ||
Factor | Weighted | Unweighted |
Factor1 | 9.39289839 | 2.57196785 |
Factor2 | 6.69316710 | 2.05723775 |
Factor3 | 5.10330088 | 1.83137584 |
Factor4 | 5.71876627 | 1.56282992 |
Comparing the four factors to the five factors, the four factors are significantly easier to interpret without factor 5.
Comparing five factors to four factors, SAS provides us information that the convergence criterion has been satisfied, both models are valid factor analysis.
|
|
Based on the above, fail to reject the null hypothesis.
By forcing SAS to use only four factors, we have better loadings and increased interpretability. Thus, the four factors produce a valid factor analysis as the proportion of variance accounted for has been maximized with fewer variables, the variables share a conceptual meaning and are tied to the industry sector and finally, the rotated factors reflect a simple structure.
This suggests that there is a sensitivity of common factor estimation to the prior estimates of the communalities and the effect of only having a small amount of data which increases the sensitivity of the estimates and weights (SAS/ETS(R) 13.1 User’s Guide. The ENTROPY Procedure, 10 Apr. 2014. Web. 14 May 2017). Having additional data available for estimation would reduce this sensitivity.
Three Factors
For comparison, forcing SAS to use three factors produces the following results:
|
|
Based on the P Value, the null hypothesis would be rejected and three factors are sufficient.
Conclusion
In conclusion, utilizing a stock portfolio data set and factor analysis to identify sectors in the stock market, we transformed log values to explain the variation in the log-returns of the stocks and created a market index. Initially, a Principal Factor Analysis was conducted without a factor rotation which resulted in SAS retaining two factors. While the eigenvalues exceeded 100% which was not in error, a scree and variance explained plot were provided to visualize the results along with an initial factor pattern plot which was difficult to interpret due to the high loading of factor 1, mixed values and lack of identifiable clusters. Characteristics of a simple factor structure were outlined to conclude that the unrotated principal factor analysis was not a simple factor structure.
Next, a varimax rotation was applied and while the number of factors selected by SAS remained the same at two, the correlation values shifted and the rotated factor pattern was easier to interpret with the clusters relating to the different sectors. Additionally, a path diagram was provided for a different perspective on the explained variance and uniqueness. A maximum likelihood estimation was also provided which again, resulted in retaining two factors but with additional information on the goodness-of-fit statistics including Akaike’s Information Criterion (AIC) and Schwarz’s Bayesian Criterion (BIC) that could be used as a basis of comparison. When compared to the varimax rotation the changes in the various loadings and eigenvalues were negligible.
Finally, a maximum likelihood analysis with a varimax rotation was again conducted but with a ‘max’ argument for the ‘priors’ option in SAS. This resulted in five factors and four factors being selected. The five factors were interesting as the Factor 5 did not have the typical loading values and did not seem to relate to a sector unlike the other factors. For visual purposes, various plots were provided amongst the different factors including Factor 5 to see how it weighted with other factors. Four factors were similar to the five factors but eliminated the Factor 5 making it easy to correspond each of the four factors to four industry sectors. The Goodness-of-Fit statistics were provided for both the five and four factors for ease of comparison but failing to reject the null hypothesis on both the macro, with no common factors, and micro perspective, with 4 or 5 factors being sufficient. Finally, for comparison purposes, three factors were provided where the null hypothesis of a p value of <.0001 were met and thus, would not reject the null hypothesis.
References
Everitt, B.S. (2010). Multivariable Modeling and Multivariate Analysis for the Behavioral Sciences. CRC Press. USA.
Rummel, R.J. (2017). Understanding Factor Analysis. Retrieved from https://www.hawaii.edu/powerkills/UFA.HTM
SAS/ETS(R) 13.1 User’s Guide. The ENTROPY Procedure, 10 Apr. 2014. Web. 14 May 2017. Retrieved from: http://support.sas.com/documentation/cdl/en/etsug/66840/HTML/default/viewer.htm#etsug_entropy_gettingstarted03.htm