STA3200Multivariate Statistical MethodsWeek 5: Principal ComponentsAnalysis (PCA)Reading• Manly, Bryan F.J. Multivariate Statistical Methods: A Primer,Third Edition, CRC Press, 07/2004.– Chapter Six: Principal components analysisManly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.Data Datasets:– sparrows.txt R code:– PCA by calculation.R– sparrow PCA.RObjectives understand the principles underlying principal componentsanalysis give a geometric interpretation of the principal componentsmethod compute principal components from given data using R; select an appropriate number of principal components usingsuitable techniques make sensible interpretations of the principal componentswhere possible compute the principal components scores for each subject conduct a spatial PCAManly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.Principal Components Analysis PCA is one of the simplest multivariate techniques. PCA is a data reduction technique, one that reduces thedimensions of the data This is possible if the variables are correlated; PCA attempts to find a new coordinate system for thedata; PCA can be thought of as a technique to identify patternsof simultaneous variation. The graphical or visual representation of reduced dimensionsis Ordination The graphical representation of PCA using bi-plots is anordination method. In climatology and related sciences, numerous variables arecorrelated, so PCA is a commonly used technique. PCA is also called empirical orthogonal function analysis(EOF) or sometimes empirical eigenvector analysis (EEA).Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.Principal Components Analysis PCA finds linear combinations of the original X variables tocreate a new set of Z variables or components. The first principle component is chosen so that it accounts foras much of the variation in the data as possible. Each subsequent component explains some additional (butincreasingly less) variation not include in Z1. Although the X variables are correlated on each newcomponent Z, the new components themselves areorthogonal (uncorrelated). The lack of correlation between derived components meansthat the are describing different aspects or “dimensions” ofthe original data. If there are p variables then there are p possiblecomponents.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.Z a X a X a X 1 11 1 12 2 1 p pPrincipal Components Analysis When using PCA we hope that we can describe the majority ofthe original variation using fewer components than originalvariables (dimension reduction). If the original variables are uncorrelated, PCA should not beused. PCA relies on the eigenvalues from the covariance matrix, C. Variances of X variables = diagonal of C matrix. Variances of PCA components = eigenvalues of C (rememberthere are as many potential components as original variables). The sum of the eigenvalues = sum of the C variances, so allPCA components account for all possible variance in the data. Before looking at the steps to PCA consider a geometricapproach to help visualise the underlying maths.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.A geometric interpretation in 2D(a) original data – a strongtrend in the SW-NEdirection;(b) two orthogonalprincipal components –first in the SW-NEdirection to explainmost variance;(c) one particular pointmapped to the newcoordinates;(d) most of the originalvariation in the datacan be explained bythe first principalcomponent only.A geometric interpretation in 2D Using just the first principal component reduces thedimension of the problem from two to one, with only a smallloss of information. We wouldn’t really need to reduce dimensions if we only hadtwo variables; panel (a) is easy enough to interpret; but thissimple example can be extrapolated to a scenario withseveral variables that we would benefit from reducing tofewer dimensions – hopefully two for plotting ease.Steps to PCA Manly, Chapter 6.2 outlines the 4 steps to PCA:1. Start by coding the variables X1, X2, …, Xp to have zeromeans (centre) and variances =1 (standardise). Sometimesonly centring is done where it is thought that the importanceof variables is reflected in their variances.2. Calculate the covariance matrix, C, or the correlation matrix,if the centring and standardising have been done in step 1.3. Find the eigenvalues λ1, λ2, …, λp and the correspondingeigenvectors a1, a2, …, ap. The coefficients of the ith principalcomponent are then the elements of ai, while λi is itsvariance.4. Discard components that account for only a small proportionof the variation in the data, e.g. if the first three componentsaccount for 90% of the total variance from 20 originalvariables it is reasonable to ignore the other components.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.PCA by calculation The correlation matrix has variances =1 because thecorrelation calculations centre and standardise the data. Calculations using the covariance matrix:> testd (covmat (cormat # Centre the data> (means (means2 (ctestd #calculate the covariance matrix from centred values> XtX (C #OR simply> (C1 #the cov matrix of the centred data is = covariance matrix of the original data because> #centring is part of the calculation of C matrix, Manly page 23.> # calculate eigen values and scores> (covmat (es (scores (es (scores # Centre and scale and calculate correlation matrix> (st.devs cstestd cstestd[,1] cstestd[,2] cstestd[,1] [,2][1,] 0.8660254 -0.7071068[2,] 0.8660254 0.0000000[3,] -0.8660254 -0.7071068[4,] -0.8660254 1.4142136> (cor1 #the cor matrix of the centred and standardised> #data is = cor matrix of the original> #data because centring and standardising the> #variance is part of calculation of> #the oorrelation matrix. Variances are=1> #i.e unit variances> # calculate eigen values and scores> (es1 (scores (es1 (es sp cor(sp)Length Extent Head Humerus SternumLength 1.0000000 0.7349642 0.6618119 0.6269482 0.6051247Extent 0.7349642 1.0000000 0.6737411 0.7621451 0.5290138Head 0.6618119 0.6737411 1.0000000 0.7184943 0.5262701Humerus 0.6269482 0.7621451 0.7184943 1.0000000 0.5787743Sternum 0.6051247 0.5290138 0.5262701 0.5787743 1.0000000> sp.prcomp sp.prcompStandard deviations:[1] 1.8911092 0.7317800 0.6155176 0.5721479 0.4266015Rotation:PC1 PC2 PC3 PC4 PC5Length 0.4548793 -0.06760175 0.7340681 0.23424318 0.4413490Extent 0.4662631 0.30512343 0.2671031 -0.47737764 -0.6247119Head 0.4494628 0.29277283 -0.3470235 0.73389847 -0.2307272Humerus 0.4635108 0.22746613 -0.4772988 -0.41989524 0.5738386Sternum 0.3985280 -0.87457014 -0.2038638 -0.04818454 -0.1800565> names(sp.prcomp)[1] “sdev” “rotation” “center” “scale” “x”PCA in R prcomp function The eigenvectors or variable loadings are called Rotation. The loadings tell us how highly each variable is correlatedwith the PC. These loadings are equivalent (although slightlydifferent) as those given in Manly Table 6.3 (table istransposed). The first PC is almost equally loaded for eachvariable; it therefore measures the general size of the bird. The second PC is highly loaded with the sternum length, notvery loaded with length, and equally loaded for the rest. It isnot easy to interpret, but perhaps is simply a measure ofsternum length. Negative loadings mean that the particularvariable is negatively correlated with the PC. The third PC has a high loading for length; perhaps it is alength PC. The fourth is a measure of head size; the fifth the contrastbetween extent and humerus.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.PCA in R prcomp function The sum of the standard deviations does not equal the sumof the diagonal elements of the correlation matrix (=5) sowe know these values are not the correct eigenvalues. The eigenvalues = total variance so square the standarddeviations: So PC1 accounts for 3.58/5=77% of the variation in sparrowmeasurements and the first 2 PCs account for 82.5%. Only the first PC has an eigenvalue above 1 (the average ofall 5) When constructing a scree plot to visualise the eigenvalues,the standard deviations are automatically converted.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.> 1.8911092 +0.7317800 +0.6155176 +0.5721479 +0.4266015[1] 4.237156> sp.prcomp$sdev^2[1] 3.5762941 0.5355019 0.3788619 0.3273533 0.1819888> 3.5762941 +0.5355019 +0.3788619 +0.3273533 +0.1819888[1] 5> screeplot(sp.prcomp)> screeplot(sp.prcomp, type=”lines”)PCA in R prcomp function Scree plots: look for a sharp elbow indicating a distinctchange in the additional contribution of further PCs In this case from PC2 on, each additional PC contributesincreasingly small amounts and much less than PC1explained variance. Using eigenvalues>1 rule and the scree plot, we shouldconsider only the first PC. Using total variance rule (90%variance explained), we may need to consider the first 3 or 4PCs.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.PCA in R prcomp function It can be helpful to visualise the loadings (rotation matrix)on each PCManly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.> load = sp.prcomp$rotation> sorted.loadings = load[order(load[,1]),1]> Main=”Loadings Plot for PC1″> xlabs=”Variable Loadings”> dotchart(sorted.loadings,main=Main,xlab=xlabs,cex=1.5,col=”red”)> # DotPlot PC2> sorted.loadings = load[order(load[,2]),2]> Main=”Loadings Plot for PC2″> xlabs=”Variable Loadings”> dotchart(sorted.loadings,main=Main,xlab=xlabs,cex=1.5,col=”red”) All variables are fairly equally loaded onthe first PC, (sternum a bit less than theothers) it therefore measures thegeneral size of the bird. A PC is a linear combination ofvariables; so on PC2 Sternum isstrongly negatively correlated with thePC while Extent, Head and Humerus areweakly positively correlated with the PC.PCA in R prcomp function We can also show graphically the relationship amongindividuals based on their component scores (ordination)and the original variables using a biplot. The ordination plot follows Manly Figure 6.1 withoutdistinction between survivors and non-survivors. The vector arrows indicate direction of increasing values inoriginal variables.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.> biplot(sp.prcomp,cex=c(1,0.7))> biplot(sp.prcomp,choices=3:4, cex=c(1,0.7))> #### Note: cex=number indicating the amount by which plotting text and symbols> # should be scaled relative to the default. 1=default, 1.5 is 50% larger,> # 0.5 is 50% smaller, etc. Use top andright axes forvectorinterpretationand bottomand left axesfor individualscores on PCs.PCA in R prcomp function On PC1 all variables (vectors) are positively correlated (topaxis); and on PC2 sternum and length are negativelycorrelated (right axis). Bird 31 stands out from the group on PC2 (left axis) with alarge negative score. Sternum is negatively correlated withPC2 and bird 31 has a high loading. Looking back at thedata this bird had the largest sternum of all birds. On PC1 (bottom axis) bird 31 is about average (all othervariables).Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.PCA in R prcomp function We can look at a table of individual scores for all 49sparrows: You will notice that the scores on PC1 and PC2 for bird 31are (0.17,-2.83) but these coordinates do not match theordination plot (bottom left) which are approximately(0.01, -0.03); and the vector coordinates do not match theloadings in the rotation output matrix. the values in $x and $rotation will not match the biplotbecause raw scores are not plotted.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.> (scores scores n step1 step2 yy xx biplot(xx,yy)Rotation of PCs One constraint on the PCs is that they must be orthogonal.We can test the correlation: Effectively zero correlation between PCs. Some authors argue that this constraint limits how well thePCs can be interpreted. If the physical interpretation of the PCs is more importantthan data reduction, some authors argue that theorthogonality constraint should be relaxed to allow betterinterpretation. This is called rotation of the PCs and many methods existfor rotation. Rotation changes the coordinates of variables on the PCs tomaximise the sum of variance of the squared loadings. Itsgoal is to ‘clean up’ or emphasise differences in the originalrotations/loadingsManly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.> cor(sp.prcomp$x[,1],sp.prcomp$x[,2])[1] 1.727662e-16Rotation of PCs cont. There are many arguments against rotation of PCs andaccordingly, R does not explicitly allow for PCs to berotated, but it can be accomplished using functionsdesigned to be used in factor analysis (where rotations areprobably the norm rather than the exception). The purpose of rotation of the PCs is generally to `cluster’the PCs together. We will look at his topic again when we cover factoranalysis.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.PCA in R using princomp function How do the results from princomp function differ toprcomp? We get the same standard deviations that we would squareto get the eigenvalues. The variable loadings are the same values but differentsigns. This not a cause for concern. Within each componentall variables that were +ve are now –ve so the relativerelationships are maintained.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.> (sp.princomp names(sp.princomp)[1] “sdev” “loadings” “center” “scale” “n.obs” “scores” “call”> sp.princomp$loadingsLoadings:Comp.1 Comp.2 Comp.3 Comp.4 Comp.5Length -0.455 0.734 0.234 -0.441Extent -0.466 0.305 0.267 -0.477 0.625Head -0.449 0.293 -0.347 0.734 0.231Humerus -0.464 0.227 -0.477 -0.420 -0.574Sternum -0.399 -0.875 -0.204 0.180
