Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press,07/2004.Page 1 of 7Tutorial Solutions – Week 5 (PCA)Question 1:a) Can the second PC explain more variation than the first PC when performing PCA on thecorrelation matrix?Solution:No, PCA fits the first PCA to explain the maximum variance and then each subsequent PC isfitted to remaining unexplained variance. Each PC explains different components of theoverall variance so that variance is summative.b) When could you use the covariance matrix rather than the correlation matrix?Solution:Use the covariance matrix if the original variables are measured on similar scales withsimilar units. When variables are measured on different scales the variance of one variablecan overwhelm the analysis. Standardising by the variable standard deviation (using thecorrelation matrix) corrects for this.c) The sum of the eigenvalues should equal what?Solution:Eigenvalues should equal the sum of the variances in the covariance matrix (diagonalelements) or the sum of the diagonal on the correlation matrix which is equal to p numberof variables. The total variance of all PCs (sum of eigenvalues) should equal the sum ofvariance for all original variables (sum of covariance diagonal elements).d) If PCA results based on the correlation matrix of 9 variables find that first 3 PCs explain82%, 7% and 2.5% of the variance respectively; only the first PC has an eigenvalue>1;and the scree plot shows a distinct elbow at PC2 and another smaller elbow at PC3, howmany components would you choose?Solution:No absolute correct answer – judgement call.The first 3 PCs explain 91.5% of the total variation leaving 8.5% explained by theremaining 6 PCs or on average 1.4% per remaining PC.Although the second PC does not have an eigenvalues>1, by contribution 7.5% tocumulative variance explained it is quite a bit larger than the 3rd PC of 2.5% and theremaining PCs. This would be the smaller elbow at PC3 on the scree plot.I would use only two PCs. Although the 3rd gets total variance over 90%, its contribution ofonly 2.5% would make any interpretation a bit vague. Using only 2 PCs makes overallinterpretation much easier (especially graphically), while losing very little explanatorypower.My decision might change if the variable loadings showed that one variable was verystrongly loaded on PC3 that was not well represented on PC1 or PC2.Question 2:Using the dataset ‘europeemploy.txt’ (from Manly Table 1.5) perform PCA on the correlationmatrix using the prcomp function. This process should include:a) Check the original correlation matrix to get an understanding of the data and the linearrelationships between variables.Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press,07/2004.Page 2 of 7Solution:> (cor (ee.prcomp #eigen values> ee.prcomp$sdev^2[1] 3.112258e+00 1.809237e+00 1.496220e+00 1.063444e+00 7.102575e-01[6] 3.113388e-01 2.934209e-01 2.038164e-01 7.097692e-06> ## %variance >(pervar(pervar1 and theycumulatively explain 83.1% of the variation in the original 9 variables. The first PC onlyexplains 34.6%. This means that after fitting the first linear combination of variables (PC1)there is still a lot of variation (65.4%) unexplained. This suggests that in 9 dimensionalspace the 30 countries are fairly scattered (also reflected in low correlations).c) Construct a scree plot. Explain your choice of the number of relevant PCs.Solution:> screeplot(ee.prcomp, type=”lines”)The elbow at PC2 would suggest only using PC1 which does not explain enough overallvariance to be useful. Another elbow at PC6 would suggest using the first 5 PCs, but onlythe first 4 had eigenvalues greater than 1. Adding the 5th PC would improve variance from83.1% on 4 PCs to 91% which could be a useful improvement although more PCs are muchharder to interpret. I will go with 4 PCs.d) Construct the Z equation for PC3. Interpret.Solution:3 0.28( ) 0.52( ) 0.50( ) 0.29( ) 0.07( )0.07( ) 0.10( ) 0.36( ) 0.41( )AGR MIN MAN PS CONSER FIN SP TZS C– + + –– + – –=On component 3 MIN and MAN are the most strongly correlated although only moderately,and in opposite directions. This component most strongly reflects the contrast betweenmanufacturing and mining industries. AGR, MAN, PS and FIN are all positively correlated todifferent degrees while all other variables are negatively correlated (to different degrees).e) Produce a biplot of the first two PCs. Interpret. Explain any differences between yourordination plot and Manly Figure 6.2.Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press,07/2004.Page 4 of 7Solution:> biplot(ee.prcomp,cex=c(0.7,0.7))On PC1 countries with high employment in Agriculture (AGR) and Mining (MIN) such asAlbany, Turkey and Czech are in contrast to those higher in Social and Personal Services(SPS) and Power and water supplies (PS). This PC accounts for 34.6% of the overallvariance in the data.On PC2 countries with high employment in services (SER) and finance (FIN) such asGibraltar contrast those with high employment in manufacturing (MAN) and transport andcommunication (TC) such as Yugoslavia (former) and Malta. This PC accounts for 20.1% ofthe overall variance in the data.The sign of country scores on PC2 is opposite to those in Manly Figure 6.2. This does notmatter as their relative positions are maintained (Albany and Romania at opposite extremesof PC2).f) Produce a biplot of the first and third PC. How does your R code need to change todisplay non-consecutive PCs?Solution:> biplot(sp.prcomp,choices=c(1,3), cex=c(0.7,0.7))When PCs are not consecutive the code for choices must change:choices=c(1,3) rather than choices=3:4Question 3:Complete Exercise 2 at the end of Chapter 6 of Manly using the data file ‘protein.txt’. Thedata is the protein consumption (grams/person/day) from 9 different sources for 25European countries.a) Base your PCA on the correlation analysis using the prcomp function. You will first needto isolate the variables to be included in PCA.Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press,07/2004.Page 5 of 7Solution:> pro str(pro)‘data.frame’: 25 obs. of 11 variables:$ Country: Factor w/ 25 levels “Albania”,”Austria”,..: 1 2 3 4 5 6 7 8 9 10…$ Rmeat : int 10 9 14 8 10 11 8 10 18 10 …$ Wmeat : int 1 14 9 6 11 11 12 5 10 3 …$ Eggs : int 1 4 4 2 3 4 4 3 3 3 …$ Milk : int 9 20 18 8 13 25 11 34 20 18 …$ Fish : int 0 2 5 1 2 10 5 6 6 6 …$ Cereals: int 42 28 27 57 34 22 25 26 28 42 …$ starch : int 1 4 6 1 5 5 7 5 5 2 …$ nuts : int 6 1 2 4 1 1 1 1 2 8 …$ FV : int 2 4 4 4 4 2 4 1 7 7 …$ Total : int 72 86 89 91 83 91 77 91 99 99 …> pro1 cor(pro1)Rmeat Wmeat Eggs Milk FishRmeat 1.00000000 0.18850977 0.57532001 0.5440251 0.06491072Wmeat 0.18850977 1.00000000 0.60095535 0.2974816 -0.19719960Eggs 0.57532001 0.60095535 1.00000000 0.6130310 0.04780844Milk 0.54402512 0.29748163 0.61303102 1.0000000 0.16246239Fish 0.06491072 -0.19719960 0.04780844 0.1624624 1.00000000Cereals -0.50970337 -0.43941908 -0.70131040 -0.5924925 -0.51714759starch 0.15383673 0.33456770 0.41266333 0.2144917 0.43868411 nutsFV-0.40988882 -0.67214885 -0.59519381 -0.6238357 -0.12226043-0.06393465 -0.07329308 -0.16392249 -0.3997753 0.22948842Cereals starch nuts FV-0.50970337 0.1538367 -0.4098888 -0.06393465RmeatWmeat-0.43941908 0.3345677 -0.6721488 -0.07329308Eggs-0.70131040 0.4126633 -0.5951938 -0.16392249Milk-0.59249246 0.2144917 -0.6238357 -0.39977527Fish-0.51714759 0.4386841 -0.1222604 0.22948842Cereals 1.00000000 -0.5781345 0.6360595 0.04229293starch-0.57813449 1.0000000 -0.4951880 0.06835670 nuts 0.63605948 -0.4951880 1.0000000 0.35133227FV 0.04229293 0.0683567 0.3513323 1.00000000There is a large range of initial correlations, none very high but a few above 0.6 so theanalysis may be worthwhile.> pro.prcomp loadings(loadings pro.prcomp$sdev^2[1] 4.0955365 1.6249031 1.0853237 0.9050170 0.4267377 0.3469402[7] 0.2695240 0.1345226 0.1114953> (pervar (pervar screeplot(pro.prcomp, type=”lines”)b) How many PCs should be considered based on the scree plot, eigenvalues and totalvariance methods?Solution:The first 3 PCs have eigenvalues >1 and together explain 75.7% of the variance. Four PCswould explain 85.8% and 5 PCs would be needed to explain 90.5%. Five PCs out of 9 isn’t abad reduction in dimensionality but still difficult to interpret. The scree plot shows elbows at2, 3 and 5 which suggest using 1, 2 or 4 PCs respectively. The first PC alone does notexplain enough variance (45.5%).I would choose 4 PCs.c) Explain the relationships between Albania and Ireland and between Portugal andBulgaria from the first 2 PCs. Try using biplots. What is a limitation inherent in yourinterpretation?Solution:> biplot(pro.prcomp,cex=c(0.7,0.7)) #ordination plot labels =row numbers> biplot(pro.prcomp,cex=c(0.7,0.7),xlabs=pro$Country) #ordination plot labels=country namesSource: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press,07/2004.Page 7 of 7Look back at original data to see that Albania=1, Ireland=12, Portugal=17 and Bulgaria =4.Also look at values for each original variable. >(pro2
