# Cluster Analysis on Transformed Predictor Variables

Cluster analysis is grouping a set of objects in a way that objects in the same group are more similar in some sense to each other than those in other groups.  Clusters are identified by assessing the relative distances between points, the relative homogeneity of each cluster and the degree of separation (Everitt, B. S. (2010) Multivariable Modeling and Multivariate Analysis for the Behavioral Sciences.  p. 241).  In this analysis of European employment data, an initial correlation analysis will be conducted to determine the highest correlated variables along with a scatterplot matrix to determine if any clusters can be detected.  Next, a principal component analysis will be conducted on the dataset containing nine variables along with a threshold of 80% explained variability to determine if a reduction of dimensionality can be attained along with the resulting number of components.  In the last step, a cluster analysis will be conducted to determine the number of clusters within our European Union regions along with an analysis based on three statistical measures including the Cubic Clustering Criterion, Pseudo F and Psudeo T- Squared.  For comparison purposes, scatterplots of three to five clusters are provided which will be compared to a cluster analysis using the SAS cluster command.  Finally, an analysis based on the two different methods and different cluster sizes will be used to determine the preferred cluster method and size.

## Step 1: Initial Correlation Analysis

Using the dataset, european_employment, the simple statistics and Pearson correlation coefficients provide us with some insight into our data. There are nine variables relating to the different types of industries. Additionally, all of the 30 countries belong to one of three EU groups – EU for the European Union, EFTA for the European Free Trade Association, Eastern for the Eastern European nations and other. Other consists of four countries – Cyprus, Gibraltar, Malta and Turkey which we could rename as the Mediterranean region. Alternatively, they could be assigned to an existing category as follows:

• Cyprus: EU
• Gibraltar: As a dependent territory of the UK which belongs to the EU
• Malta: EU
• Turkey: EFTA

As shown below, there are both positive and negative related correlations.

• AGR: Agriculture
• MIN: mining
• MAN: manufacturing
• PS: Power and water supply
• CON: construction
• SER: services
• FIN: finance
• SPS: Social and personal services
• TC: Transport and communications

The related industries to the abbreviations are:

 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum AGR 30 12.18667 12.30690 365.60000 0 55.50000 MIN 30 3.44667 8.86573 103.40000 0 37.30000 MAN 30 20.28667 9.45679 608.60000 0 38.70000 PS 30 0.80000 0.62090 24.00000 0 2.20000 CON 30 7.53000 2.73309 225.90000 0.60000 16.90000 SER 30 15.63667 5.16016 469.10000 3.30000 24.50000 FIN 30 6.65000 3.98668 199.50000 0 15.30000 SPS 30 26.99333 8.73206 809.80000 0 41.60000 TC 30 6.45333 1.23337 193.60000 3.00000 8.80000

The Pearson coefficients:

 Pearson Correlation Coefficients, N = 30 Prob > |r| under H0: Rho=0 AGR MIN MAN PS CON SER FIN SPS TC AGR 1.00000 0.31607 0.0888 -0.25439 0.1749 -0.38236 0.0370 -0.34861 0.0590 -0.60471 0.0004 -0.17575 0.3529 -0.81148 <.0001 -0.48733 0.0063 MIN 0.31607 0.0888 1.00000 -0.67193 <.0001 -0.38738 0.0344 -0.12902 0.4968 -0.40655 0.0258 -0.24806 0.1863 -0.31642 0.0885 0.04470 0.8146 MAN -0.25439 0.1749 -0.67193 <.0001 1.00000 0.38789 0.0342 -0.03446 0.8565 -0.03294 0.8628 -0.27374 0.1433 0.05028 0.7919 0.24290 0.1959 PS -0.38236 0.0370 -0.38738 0.0344 0.38789 0.0342 1.00000 0.16480 0.3842 0.15498 0.4135 0.09431 0.6201 0.23774 0.2059 0.10537 0.5795 CON -0.34861 0.0590 -0.12902 0.4968 -0.03446 0.8565 0.16480 0.3842 1.00000 0.47308 0.0083 -0.01802 0.9247 0.07201 0.7053 -0.05461 0.7744 SER -0.60471 0.0004 -0.40655 0.0258 -0.03294 0.8628 0.15498 0.4135 0.47308 0.0083 1.00000 0.37928 0.0387 0.38798 0.0341 -0.08489 0.6556 FIN -0.17575 0.3529 -0.24806 0.1863 -0.27374 0.1433 0.09431 0.6201 -0.01802 0.9247 0.37928 0.0387 1.00000 0.16602 0.3806 -0.39132 0.0325 SPS -0.81148 <.0001 -0.31642 0.0885 0.05028 0.7919 0.23774 0.2059 0.07201 0.7053 0.38798 0.0341 0.16602 0.3806 1.00000 0.47492 0.0080 TC -0.48733 0.0063 0.04470 0.8146 0.24290 0.1959 0.10537 0.5795 -0.05461 0.7744 -0.08489 0.6556 -0.39132 0.0325 0.47492 0.0080 1.00000

As shown above, the highest correlation with a p value of <.0001 is between AGR and SPS with a -.81148 correlation followed by -.67193 between MAN and MIN.  The related matrix of scatter plots are shown below where we are searching for clusters or groups of data points.  The only plot that seems to have two or three distinct groups of data is the SER and FIN plot.

cluster analysis scatter plot

The scatterplot for SPS and AGR is shown below (left) and for comparison purposes also for MAN and MIN (below right).  The differences are very apparent.  Albania is clearly an outlier but the plots also don’t seem to see any groupings based on the groupings of EU, EFTA, Eastern and Other.

Cluster Analysis Scatterplot – two variables

Cluster Analysis Scatterplot – two variables

## Step 2:  Principal Components Analysis & Reduction of Dimensionality

The eigenvectors are the ‘axis’ along which a linear transformation is stretched, compressed or flipped the eigenvalues give you the factors of the compression.  The scoring coefficients as eigenvectors creates components that have variances equal to the eigenvalues as follows:

 Eigenvectors Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Prin8 Prin9 AGR -.511492 0.023475 -.278591 0.016492 -.024038 0.042397 -.163574 0.540409 0.582036 MIN -.374983 -.000491 0.515052 0.113606 0.346313 -.198574 0.212590 -.448592 0.418818 MAN 0.246161 -.431752 -.502056 0.058270 -.233622 0.030917 0.236015 -.431757 0.447086 PS 0.316120 -.109144 -.293695 0.023245 0.854448 -.206471 -.060565 0.155122 0.030251 CON 0.221599 0.242471 0.071531 0.782666 0.062151 0.502636 -.020285 0.030823 0.128656 SER 0.381536 0.408256 0.065149 0.169038 -.266673 -.672694 0.174839 0.201753 0.245021 FIN 0.131088 0.552939 -.095654 -.489218 0.131288 0.405935 0.457645 -.027264 0.190758 SPS 0.428162 -.054706 0.360159 -.317243 -.045718 0.158453 -.621330 -.041476 0.410315 TC 0.205071 -.516650 0.412996 -.042063 -.022901 0.141898 0.492145 0.502124 0.060743

The principal component score can be calculated using the eigenvectors to create the formula.  For example, AGR formula would be:

(-.511492 x AGR) + (.023475 x MIN) + (-.278591 x MAN) + (.016492 x PS) + (-.024038 x CON) + (.042397 x SER) + (-.163574 x FIN) + (.540409 x SPS) + (.582036 x TC)

Reducing the number of dimensions through principal component analysis with a 80%variability threshold can be determined using the eigenvalues of the correlation matrix, specifically the cumulative column, and the scree plot.

 Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative 1 3.11225795 1.30302071 0.3458 0.3458 2 1.80923724 0.31301704 0.2010 0.5468 3 1.49622020 0.43277636 0.1662 0.7131 4 1.06344384 0.35318631 0.1182 0.8312 5 0.71025753 0.39891874 0.0789 0.9102 6 0.31133879 0.01791787 0.0346 0.9448 7 0.29342091 0.08960446 0.0326 0.9774 8 0.20381645 0.20380935 0.0226 1.0000 9 0.00000710 0.0000 1.0000

When there is a high correlation you will be able to do significant dimension reduction with usually the first two values accounting for 80% of the variance.  However, the above correlation matrix reflects that there is not a strong principal component as the cumulative values within the first two components is only .5468 data correlation.  If we want to explain 80% of the variation, four components will be required.

The scree plot and variance explained plots:

Cluster Analysis Scree Plot

Based on a threshold of 80%, four principal components will explain over 80% of the variability.  Additionally, if the 80% threshold is to low, adding one more principal component for a total of five would result in explaining over 90% of the variability.  The scree plot would be another alternative but looking at the elbows in the line.  A third method of determining the number of principal components to use is by evaluating the number of eigenvalues that exceed 1.0 in the correlation matrix.  As shown above, the matrix has four values (3.11, 1.809, 1.49, 1.06) that exceed one and as such would also result in four principal components.

Cluster Analysis Component Patterns

While it is difficult to determine which of the components is the most horizontal if you look closely you can see that component 9 (a red line has been added for easy of viewing) has the minimum depth on the valleys and hills indicating that it is the component with a high correlation across all of the variables.

Cluster Analysis Component Scores

The principal component scores scatter plot of the second and third components displays density with color based on the first component and moving from blue with the minimum density to pink with medium density and red with maximum density.

The below reflects two components accounting for 54.68% of the variability (component 1 with 34.58% and component 2 with 20.1%).  A two-dimensional scatterplot allows us to view the nine dimensional clusters a lot easier and how than say a three-dimension plot along with the different clusters and their strength of relatedness.

Cluster Analysis Component Pattern

## Step 3:  Cluster Analysis

The scatterplots of FIN and SER where there seems to be three clusters.

Cluster Analysis Grouping

Comparing to MAN and SER there appears to be four clusters.

Cluster Analysis Grouping

As we can see from the two scatterplots between finance (FIN) and services (SER) and manufacturing (MAN) and services (SER), different projections of the data produce different clustering that we need to keep in mind.

The cluster analysis output from SAS shows a hierarchical cluster with 30 countries starting from the bottom and moving upwards.  It clusters the countries by their distance with the semi partial R-Square reflecting the loss of homogeneity due to combing groups to form a cluster with the r-squared measuring the extent to which the clusters are different from each other.

Cluster Analysis Table CCC Pseudo F Psedo t squared

Graphically, the cubic clustering criteria, pseudo F and Pseudo t-squared with noted peaks:

Cluster Analysis Criteria for Number of Clusters

The Cubic Clustering Criterion (CCC) compares the R-Squared with a specific number of clusters versus the R-Squared attained by clustering a uniformly distributed set of points.  Thus, it can be interpreted just like the R-Squared statistic.  The larger number indicates that there are clusters The CCC above reflects a local peak at three and the maximum peak at five clusters

Pseudo F is similar to r-squared and provides a measure of how separated the clusters are with the ratio between cluster variance to within cluster variance.  The larger the number indicates the more significant and the better the cluster.   The graphic reflects the change in ratio moving from 40 at three clusters to 50 at five clusters.

The Pseudo T-Squared is an index that quantifies the difference in the ratio of between cluster variance to within cluster variances once the clusters are merged (Simmons, R.  (2015).  ResearchGate). In the table, the smaller the number the better.  In the plot, the optimal number of clusters is indicated by moving right to left at the upsurges and adding one to attain the optimal number of clusters.  In the above we have two surges thus, the optimal number of clusters would be at five (four plus one) or at three (two plus one).

Based on the above CCC, Pseudo F and Pseudo T-Squared, five is the most appropriate number of clusters.

The scatterplots and trees will provide some additional insight into the number of clusters starting with three clusters and moving to five clusters.

Cluster Analysis Average Distance between Cluster

Moving from left (root) to the right, R-Squared approaches one and the clusters of countries are progressively joined to form larger clusters at the various numerical levels.  For example, Germany, France, Norway, Ireland and Belgium form a cluster at approximately at the .2 level.

Viewed vertically:

### Three Clusters

Viewing three clusters comprises of two clusters and Albania being its own cluster.

Cluster Analysis Three Clusters

 Table of GROUP by CLUSNAME GROUP CLUSNAME Frequency Albania CL3 CL6 Total EFTA 0 6 0 6 EU 0 12 0 12 Eastern 1 0 7 8 Other 0 2 2 4 Total 1 20 9 30

### Four Clusters

An improvement over the three clusters plus Albania again being its own distinct cluster.

Cluster Analysis Four Variables

 Table of GROUP by CLUSNAME GROUP CLUSNAME Frequency Albania CL4 CL5 CL6 Total EFTA 0 5 1 0 6 EU 0 10 2 0 12 Eastern 1 0 0 7 8 Other 0 1 1 2 4 Total 1 16 4 9 30

### Five Clusters

Finally, five clusters with Albania still remaining to be its own clusters.

Cluster Analysis Five Variables

 Table of GROUP by CLUSNAME GROUP CLUSNAME Frequency Albania CL10 CL5 CL6 CL7 Total EFTA 0 4 1 0 1 6 EU 0 5 2 0 5 12 Eastern 1 0 0 7 0 8 Other 0 0 1 2 1 4 Total 1 9 4 9 7 30

As the tables above demonstrate, the number of countries in each European cluster remain the same their composition and resulting summation changes between the different clusters except for Albania.

Additionally, while the membership of each group changed as the volume increased, the three clusters scatterplot seemed to lack sufficient detail.  Five clusters seem to have the right amount of detail with the four clusters and a fifth one for Albania.

## Step 4:  Cluster Analysis using Cluster Commands

Cluster analysis using the cluster command in SAS provides an additional scatterplots and related clusters to compare to.

### Three Clusters

The three cluster has two outliers which is very different from the prior three cluster without using the SAS cluster command.

Cluster Analysis Scatterplot

 Table of GROUP by CLUSNAME GROUP CLUSNAME Frequency Albania CL3 Gibralta Total EFTA 0 6 0 6 EU 0 12 0 12 Eastern 1 7 0 8 Other 0 3 1 4 Total 1 28 1 30

### Four Clusters

Two main clusters and two outlier clusters are reflected in the scatterplot.  However, the two main clusters seem a bit disjointed.

Cluster Analysis – Four Clusters

 Table of GROUP by CLUSNAME GROUP CLUSNAME Frequency Albania CL3 Gibralta Total EFTA 0 6 0 6 EU 0 12 0 12 Eastern 1 7 0 8 Other 0 3 1 4 Total 1 28 1 30

### Five Clusters

The five clusters are not as disjointed as the four clusters.

Cluster Analysis – Five Cluster

 Table of GROUP by CLUSNAME GROUP CLUSNAME Frequency Albania CL5 CL6 CL7 Gibralta Total EFTA 0 0 0 6 0 6 EU 0 0 0 12 0 12 Eastern 1 4 3 0 0 8 Other 0 1 1 1 1 4 Total 1 5 4 19 1 30

### Six Clusters

Finally, the six clusters seem to have a bit too many clusters with the splitting of CL5 from the five-cluster scatterplot into two different clusters.

Cluster Analysis – Six Clusters

 Table of GROUP by CLUSNAME GROUP CLUSNAME Frequency Albania CL11 CL24 CL6 CL7 Gibralta Total EFTA 0 0 0 0 6 0 6 EU 0 0 0 0 12 0 12 Eastern 1 3 1 3 0 0 8 Other 0 0 1 1 1 1 4 Total 1 3 2 4 19 1 30

Based on the cluster analysis using cluster commands the three and four clusters don’t seem to produce enough details as one cluster is Albania.  On the five clusters, we have a nice balance of detail between the different geographic areas and in the six clusters, we have a bit too much details with the separation of Malta and Yugoslavia into a new cluster which doesn’t add much value.  None of the clusters are aligned with the European Union geographical areas.

Overall, the SAS cluster command’s results were not very effective in determining the clusters with the clusters in step three being more effective and logical to both interpret and understand.

## Conclusion

In conclusion, the European employment data was used to perform a cluster analysis beginning with an exploratory data analysis based on 30 different countries.  A scatterplot matrix was examined to determine if there were any obvious clusters along with a correlation analysis and scatterplots of two correlated variables, agriculture (AGR) and social and personal services industries (SPS) within areas in the European Union.  While the correlations were not very strong, a scatterplot provided an effective method to view the differences.

A principal component analysis based on eigenvectors and their resulting values along with a scree and variance explained plot were also utilized to determine the appropriate number of principal components.  Based on a variance explained threshold of 80%, five principal components were determined to be the number of factors required.  While we would prefer to have fewer factors to explain the variability, reducing the original nine factors to five factors is not a significant reduction and in reality, a decision to use all of the nine factors should be considered.

A cluster analysis based on the number of clusters within four European Union regions along with their associated statistical measures including the Cubic Clustering Criterion, Pseudo F and Pseudo T- Squared were also analyzed.  Scatterplots of clusters ranging from three to five in size were reviewed with a determination that five clusters were the most appropriate.

Finally, a similar comparison was conducted but ranging from three to six clusters and utilizing the SAS cluster command.  As the SAS cluster command resulted in having a unique cluster for Albania and another for Gibraltar, the SAS cluster command was not very effective in producing meaningful clusters.  As such, our preferred cluster method was the earlier method, not utilizing the SAS cluster command, and resulted in five clusters as it grouped the points close together with a meaningful way to reflect the data between the countries and the clusters.

# References

Everitt, B. S. (2010) Multivariable Modeling and Multivariate Analysis for the Behavioral Sciences.  Boca Raton, FF:  CRC Press.

Simmons, R. (2015, June 28).  ResearchGate: “Could someone help me decide the ideal no. of clusters from the pseudo t squared graph in SAS?”.  Retrieved from Research Gate.

## SAS Code

%let path=/sscc/home/c/ctb2523/410data/;
libname orion “&path”;

data temp; set orion.european_employment;
proc contents; run;

* Step 1: Produce the scatterplot matrix;
ods graphics on;
title Correlation Structure of the Raw Data;
proc corr data=temp plot=matrix(histogram nvar=all);
run; quit;
title ;
ods graphics off;

ods graphics on;
proc sgplot data=temp;
title ‘Scatterplot of Raw Data’;
scatter y=AGR x=SPS / datalabel=country group=group; run;
ods graphics off;

** Step 2;
ods graphics on;
title Principal Components Analysis using PROC PRINCOMP;
proc princomp data=temp out=pca_9components outstat=eigenvectors plots=all; run;
ods graphics off;

** Step 3;
ods graphics on;
proc cluster data=temp method=average outtree=tree1 pseudo ccc plots=all; var fin ser;
id country; run; quit;
ods graphics off;

ods graphics on;
proc sgplot data=temp;
title ‘Scatterplot of FIN and SER’;
scatter y=FIN x=SER / datalabel=country group=group; run;
ods graphics off;

ods graphics on;
proc sgplot data=temp;
title ‘Scatterplot of MAN and SER’;
scatter y=MAN x=SER / datalabel=country group=group; run;
ods graphics off;

**Step 4

ods graphics on;
title ”;
proc tree data=tree1 ncl=5 out=_5_clusters; copy man ser;
run; quit;
ods graphics off;
ods graphics on;
proc tree data=tree1 ncl=4 out=_4_clusters; copy man ser;
run; quit;
ods graphics off;
ods graphics on;
proc tree data=tree1 ncl=3 out=_3_clusters; copy man ser;
run; quit;
ods graphics off;

%macro makeTable(treeout,group,outdata); data tree_data;
set &treeout.(rename=(_name_=country));
run;
proc sort data=tree_data; by country; run; quit; data group_affiliation;
set &group.(keep=group country);
run;
proc sort data=group_affiliation; by country; run; quit; data &outdata.;
merge tree_data group_affiliation; by country;
run;
proc freq data=&outdata.;
table group*clusname / nopercent norow nocol; run;
%mend makeTable;

* Call macro function;
%makeTable(treeout=_3_clusters,group=temp,outdata=_3_clusters_with_labels);
* Plot the clusters for a visual display; ods graphics on;
proc sgplot data=_3_clusters_with_labels;
title ‘Scatterplot of Raw Data’;
scatter y=man x=ser / datalabel=country group=clusname; run; quit;
ods graphics off;
%makeTable(treeout=_4_clusters,group=temp,outdata=_4_clusters_with_labels);
* Plot the clusters for a visual display; ods graphics on;
proc sgplot data=_4_clusters_with_labels;
title ‘Scatterplot of Raw Data’;
scatter y=man x=ser / datalabel=country group=clusname; run; quit;
ods graphics off;
%makeTable(treeout=_5_clusters,group=temp,outdata=_5_clusters_with_labels);
* Plot the clusters for a visual display; ods graphics on;
proc sgplot data=_5_clusters_with_labels;
title ‘Scatterplot of Raw Data’;
scatter y=man x=ser / datalabel=country group=clusname; run; quit;
***Cluster with all varibles;
ods graphics on;
proc cluster data=temp method=average outtree=tree1 pseudo ccc plots=all;
var agr min man ps con ser fin sps tc;
id country; run; quit;
ods graphics off;

ods graphics on;
proc tree data=tree1 ncl=5 out=_5_clusters;
copy agr min man ps con ser fin sps tc;
run; quit;
ods graphics off;
ods graphics on;
proc tree data=tree1 ncl=4 out=_4_clusters;
copy agr min man ps con ser fin sps tc;
run; quit;
ods graphics off;
ods graphics on;
proc tree data=tree1 ncl=3 out=_3_clusters;
copy agr min man ps con ser fin sps tc;
run; quit;
ods graphics off;

%macro makeTable(treeout,group,outdata); data tree_data;
set &treeout.(rename=(_name_=country));
run;
proc sort data=tree_data; by country; run; quit; data group_affiliation;
set &group.(keep=group country);
run;
proc sort data=group_affiliation; by country; run; quit; data &outdata.;
merge tree_data group_affiliation; by country;
run;
proc freq data=&outdata.;
table group*clusname / nopercent norow nocol; run;
%mend makeTable;

* Call macro function;
%makeTable(treeout=_3_clusters,group=temp,outdata=_3_clusters_with_labels);
* Plot the clusters for a visual display; ods graphics on;
proc sgplot data=_3_clusters_with_labels;
title ‘Scatterplot of Raw Data’;
scatter y=fin x=ser / datalabel=country group=clusname; run; quit;
ods graphics off;
%makeTable(treeout=_4_clusters,group=temp,outdata=_4_clusters_with_labels);
* Plot the clusters for a visual display; ods graphics on;
proc sgplot data=_4_clusters_with_labels;
title ‘Scatterplot of Raw Data’;
scatter y=fin x=ser / datalabel=country group=clusname; run; quit;
ods graphics off;
%makeTable(treeout=_5_clusters,group=temp,outdata=_5_clusters_with_labels);
* Plot the clusters for a visual display; ods graphics on;
proc sgplot data=_5_clusters_with_labels;
title ‘Scatterplot of Raw Data’;
scatter y=fin x=ser / datalabel=country group=clusname; run; quit;
* Using the first 2 principal components;
ods graphics on;
proc cluster data=pca_9components method=average outtree=tree3 pseudo ccc plots=all;
var prin1 prin2; id country;
run; quit;
ods graphics off;
ods graphics on;
proc tree data=tree3 ncl=6 out=_6_clusters; copy prin1 prin2;
run; quit;
proc tree data=tree3 ncl=5 out=_5_clusters; copy prin1 prin2;
run; quit;
proc tree data=tree3 ncl=4 out=_4_clusters; copy prin1 prin2;
run; quit;
proc tree data=tree3 ncl=3 out=_3_clusters; copy prin1 prin2;
run; quit;
ods graphics off;
%makeTable(treeout=_3_clusters,group=temp,outdata=_3_clusters_with_labels);
%makeTable(treeout=_4_clusters,group=temp,outdata=_4_clusters_with_labels);
%makeTable(treeout=_5_clusters,group=temp,outdata=_5_clusters_with_labels);
%makeTable(treeout=_6_clusters,group=temp,outdata=_6_clusters_with_labels);
* Plot the clusters for a visual display; ods graphics on;

proc sgplot data=_3_clusters_with_labels;
title ‘Scatterplot of Prin Data’;
scatter y=prin2 x=prin1 / datalabel=country group=clusname; run; quit;
ods graphics off;
* Plot the clusters for a visual display; ods graphics on;
proc sgplot data=_4_clusters_with_labels;
title ‘Scatterplot of Prin Data’;
scatter y=prin2 x=prin1 / datalabel=country group=clusname; run; quit;
ods graphics off;
* Plot the clusters for a visual display; ods graphics on;
proc sgplot data=_5_clusters_with_labels;
title ‘Scatterplot of Prin Data’;
scatter y=prin2 x=prin1 / datalabel=country group=clusname; run; quit;
proc sgplot data=_6_clusters_with_labels;
title ‘Scatterplot of Prin Data’;
scatter y=prin2 x=prin1 / datalabel=country group=clusname; run; quit;