Multidimensional Scaling | Reliable Papers

STA3200Multivariate Statistical MethodsWeek 11: MultidimensionalScaling (MDS)Reading• Manly, Bryan F.J. Multivariate Statistical Methods: A Primer,Third Edition, CRC Press, 07/2004.– Chapter Eleven: Multidimensional ScalingManly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.Data Datasets:– roads.txt– congressman.txt R code:– roads MDS.R– congressman.R R package to install: veganObjectives Understand the principles underlying MDS and how it relatesto other methods covered in this course; compute MDS plots using R; understand the need for measures of Goodness of Fit and/orSTRESS. Multi-Dimensional Scaling (MDS) is a distance basedordination technique, used to reduce dimensions. It tries to represent the dissimilarity (distance) betweenmultivariate observations (the objects/cases) in a lowerdimensional space by an iterative optimisation technique. The distances are mapped to ordination space, rather thanusing a dendrogram as in cluster analysis.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.IntroductionIntroduction Table 11.1 and Figure 11.2 in Manly give an example of howdistances can be mapped. As more objects are included it becomes increasingly difficultto accurately map objects in 2D space.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.Conceptual understanding of MDS MDS reduces n-dimensional space to a lower number ofdimensions (2 or 3) but maintains as close as possible the“true” distances between cases/objects (often sites). Ideally the relationships described by the “true” distancematrix in n-dimensions should be the same as the “new”distance matrix in the lower number of dimensions (2 or 3). The difference between the “new” distance matrix (low-dim.space) compared to the “true” distance matrix (n-dim.space) is a measure of the effectiveness of the MDS process. This measure, the lack of fit or loss of information whenconverting an n-dimensional information set to a lower (2 or3) dimensional set, is called Stress. The closer the stress is to zero the closer the fit of the MDSmodel is to the original distance matrix (i.e. the lower theinformation lost, the lower the stress value).Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.Procedure for MDS The process is described by Manly in Chapter 11.2. We start with a data set of n objects (cases/rows) and tdimensions, where t is the number of variables (x1, x2,..,xt) We can view the original variables as a coordinate systemfor each object in t-dimensional space. We calculate the distance matrix between all objects basedon t-dimensions δijManly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.X1 X2 X3 X4 X5 XtA B C D nA B C D nA B C D nProcedure for MDS We nominate the number of dimensions we would like tomap the objects to (2 or 3). We then need to find a set ofcoordinates in these reduced dimensions that match thedistance matrix between objects as closely as possible. It is unlikely that we will be able to reduce the dimensionswithout some changes to the distance matrix, so in effect wefind a new coordinate system and a new distance matrix. Ideally the ‘relationships’ described by the “true” distancematrix in t-dimensions should be the same as therelationships in the “new” distance matrix in the lowernumber of dimensions (2 or 3), even if the distancesthemselves are not exactly the same.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.Procedure for MDS It can help to imagine the objects, when described by theoriginal variables and distance matrix, existing in a sphereshape. If we then flatten that sphere so that the objects reston a 2D plane then you can imagine that the objects willneed to shift around a bit and therefore the distances willchange. If we increase the distance slightly between twoobjects, how does that affect the distance of these objectswith others? We could come up with multiple solutions – some leading toa large change in the distance matrix and others less so. So, the process is an iterative procedure and at eachiteration the difference between original and new distancematrices is reduced until the difference converges to a stablevalue. This difference is called the STRESS.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.Stress If stress doesn’t vary much across all iterations then youhave a strong global solution. Stress ranges from 0 to 1 Stress of :0.30 No better than random0.25 – 0.30 Poor – better do something else 0.25 – 0.200.20 – 0.15Wish it was smaller – global may be okOK – global & most local distances arereasonable 0.15 – 0.10 Good 0.10 – 0.05< 0.05GreatExcellent, you probably onlyhave 2 or 3 dependent variables (notmuch need for dimension reduction). Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.Procedure for MDS In most cases Euclidian distances are used to define the trueand new distance matrices. There are different algorithms for calculating the new distancematrix. Manly Chapter 11.2 describes one approach based on aregression procedure. Define the original positions of objects inmultidimensional space Specify the number m of reduced dimensions (typically2). Construct an initial configuration of the samples in 2-dimensions. Regress distances in this initial configuration against theobserved (measured) distances Determine the stress (disagreement between 2-Dconfiguration and predicted values from regression). If stress is high, reposition the points in m dimensions inthe direction of decreasing stress, and repeat until stressis below some threshold.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.dijdijProcedure for MDS The new distance matrix can then be seen as a linear (ornon-linear depending on regression model used)transformation of the original distance matrix equal to theoriginal distance matrix plus a matrix of errors. The iterative process then tries to minimise the sum ofsquares error matrix to find the best solution. Kruskal’s Stress 1 equation is: There are both metric and nonmetric forms of MDS. Metric: the configuration distances and the original datadistances are related by a linear or polynomial regression. Nonmetric: a monotonic regression is used which considersonly the ordering of distances. Generally, the greaterflexibility of nonmetric scaling should make it possible toobtain a better low-dimensional representation of the data.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004. 1/22/ 2        d d d ij ij ij Procedure for MDS To find a lower (better) stress value: Increase the number of dimensions in the new solution. High stress may be caused by a single object (case)which is completely different to all other cases – remove. If the original variables are based on presence/absencedata or count data, such as species counts, there may bea large number of rare species causing a low “signal tonoise ratio” – remove. Use non-metric MDS which uses ranked distances(ordinal scale) instead of the original metric distances. Inecological data this is useful when sites do not have a lotof species in common. MDS does not rely on most common assumptions(multivariate normality, etc.). The only assumptions are thenumber of dimensions cannot exceed the number of objectsminus one (which also means at least three variables mustbe entered in the model) and at least two dimensions mustbe specified.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.MDS and CA MDS and CA are both based on distances and neitherrequires apriori groups. They differ in how they use the distance matrix to reducedimensions. CA sequentially pairs and recalculates distances based on the givenclustering algorithm (nearest neighbour etc.). Theobjects/cases/samples are then displayed using the 2D dendrogram. MDS tries to represent the distance matrix on a 2D ordination plot, i.e.spatially. All distances are subtly (hopefully) altered to allow the closestapproximation of the true distances to be plotted. A metric MDS based on Euclidian distances and a linearregression process (a linear MDS) should produce verysimilar results to a PCA which also reduces dimensions byfinding a linear approximation of the original variables inlower dimensions.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.MDS and PCAMDS in R Manly Example 11.1 road distances in New Zealand. Road distances for 13 towns will be used to map towns. Inthis example we are not trying to reduce dimensionality ofthe data but instead see how the distance matrix can beused to reproduce a 2D coordinate system to map theobjects. Start with the distance matrixManly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.> rd rdAlexandra Balcluha Blenheim Christchurch Dunedin FranzJosef GreymouthAlexandra 0 100 485 284 126 233 347Balcluha 100 0 478 276 50 493 402Blenheim 485 478 0 201 427 327 214Christchurch 284 276 201 0 226 247 158Dunedin 126 50 427 226 0 354 352FranzJosef 233 493 327 247 354 0 114Greymouth 347 402 214 158 352 114 0Invercargill 138 89 567 365 139 380 493Milford 248 213 691 489 263 416 555Nelson 563 537 73 267 493 300 187Queenstown 56 156 494 305 192 228 341TeAnau 173 138 615 414 188 366 480Timaru 197 177 300 99 127 313 225Invercargill Milford Nelson Queenstown TeAnau TimaruAlexandra 138 248 563 56 173 197Balcluha 89 213 537 156 138 177Blenheim 567 691 73 494 615 300Christchurch 365 489 267 305 414 99Dunedin 139 263 493 192 188 127FranzJosef 380 416 300 228 366 313Greymouth 493 555 187 341 480 225Invercargill 0 174 632 118 99 266Milford 174 0 756 178 75 377Nelson 632 756 0 572 681 366Queenstown 118 178 572 0 117 230TeAnau 99 75 681 117 0 315Timaru 266 377 366 230 315 0MDS in R We can’t simply plot this distance matrix in 2D space – whatwould be the x and y axes? We need the x and y axes to represent a coordinate system(like map coordinates).Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.> rd rdAlexandra Balcluha Blenheim Christchurch Dunedin FranzJosef GreymouthAlexandra 0 100 485 284 126 233 347Balcluha 100 0 478 276 50 493 402Blenheim 485 478 0 201 427 327 214Christchurch 284 276 201 0 226 247 158Dunedin 126 50 427 226 0 354 352FranzJosef 233 493 327 247 354 0 114Greymouth 347 402 214 158 352 114 0Invercargill 138 89 567 365 139 380 493Milford 248 213 691 489 263 416 555Nelson 563 537 73 267 493 300 187Queenstown 56 156 494 305 192 228 341TeAnau 173 138 615 414 188 366 480Timaru 197 177 300 99 127 313 225Invercargill Milford Nelson Queenstown TeAnau TimaruAlexandra 138 248 563 56 173 197Balcluha 89 213 537 156 138 177Blenheim 567 691 73 494 615 300Christchurch 365 489 267 305 414 99Dunedin 139 263 493 192 188 127FranzJosef 380 416 300 228 366 313Greymouth 493 555 187 341 480 225Invercargill 0 174 632 118 99 266Milford 174 0 756 178 75 377Nelson 632 756 0 572 681 366Queenstown 118 178 572 0 117 230TeAnau 99 75 681 117 0 315Timaru 266 377 366 230 315 0MDS in R Manly Example 11.1 road distances in New Zealand. Using the vegan package and the monoMDS function whichrequires a distance matrix and k=the number of dimensions Manly used a non-metric monotone regression process(assumes that as the observed distances increase theestimated distances stay the same or increase).Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.> library(vegan)Loading required package: permuteLoading required package: latticeThis is vegan 2.3-5> m mCall:monoMDS(dist = rd, k = 2)Non-metric Multidimensional Scaling13 points, dissimilarity ‘unknown’Dimensions: 2Stress: 0.03644199Stress type 1, weak tiesScores scaled to unit root mean square, rotated to principalcomponentsStopped after 55 iterations: Stress nearly unchanged (ratio >sratmax)> plot(m)MDS in R The resulting plot (map) of towns is not exactly the same asManly Figure 11.4 but the ordination plot is showing asimilar relationship between towns in 2D ordination space. The stress is very low (0.036) because the original distancematrix was based on distances measured in 2D (the distancematrix was not calculated from several variables measuredon different scales).Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.> library(vegan)> mCall:monoMDS(dist = rd, k = 2)Non-metric Multidimensional Scaling13 points, dissimilarity ‘unknown’Dimensions: 2Stress: 0.036442Stress type 1, weak tiesScores scaled to unit root mean square, rotated to principal componentsStopped after 69 iterations: Stress nearly unchanged (ratio > sratmax)MDS in R Manly Example 11.2 voting behaviour of congressmen. The data matrix represents the count of votingdisagreements between congressmen – a form of distancebetween congressmen on their opinions on environmentalmatters. Use function ‘cmdscale’ for metric MDS. Different results willresult from different distance measures. Defining thenumber of dimensions does not change the results, it simplytruncates the output. Instead of a STRESS, a GoF test is applied [0,1] – highervalues indicate better fits. The goodness of fit test relies oneigenvalue analysis in addition to the iterative procedureapplied to the distance matrix described previously.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.MDS in R Manly Example 11.2 voting behaviour of congressmen.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.> cm head(cm)Hunt.R. Sandman.R. Howard.D. Thompson.D. Frelinghuysen.R. Forsythe.R. Widnall.R.Hunt(R) 0 8 15 15 10 9 7Sandman(R) 8 0 17 12 13 13 12Howard(D) 15 17 0 9 16 12 15Thompson(D) 15 12 9 0 14 12 13Frelinghuysen(R) 10 13 16 14 0 8 9Forsythe(R) 9 13 12 12 8 0 7Roe.D. Helstoski.D. Rodino.D. Minish.D. Rinaldo.R. Maraziti.R. Daniels.D.Hunt(R) 15 16 14 15 16 7 11Sandman(R) 16 17 15 16 17 13 12Howard(D) 5 5 6 5 4 11 10Thompson(D) 10 8 8 8 6 15 10Frelinghuysen(R) 13 14 12 12 12 10 11Forsythe(R) 12 11 10 9 10 6 6Pattern.D.Hunt(R) 13Sandman(R) 16Howard(D) 7Thompson(D) 7Frelinghuysen(R) 11Forsythe(R) 10> str(cm)‘data.frame’: 15 obs. of 15 variables:$ Hunt.R. : int 0 8 15 15 10 9 7 15 16 14 …$ Sandman.R. : int 8 0 17 12 13 13 12 16 17 15 …$ Howard.D. : int 15 17 0 9 16 12 15 5 5 6 …$ Thompson.D. : int 15 12 9 0 14 12 13 10 8 8 …$ Frelinghuysen.R.: int 10 13 16 14 0 8 9 13 14 12 …$ Forsythe.R. : int 9 13 12 12 8 0 7 12 11 10 …$ Widnall.R. : int 7 12 15 13 9 7 0 17 16 15 …$ Roe.D. : int 15 16 5 10 13 12 17 0 4 5 …$ Helstoski.D. : int 16 17 5 8 14 11 16 4 0 3 …$ Rodino.D. : int 14 15 6 8 12 10 15 5 3 0 …$ Minish.D. : int 15 16 5 8 12 9 14 5 2 1 …$ Rinaldo.R. : int 16 17 4 6 12 10 15 3 1 2 …$ Maraziti.R. : int 7 13 11 15 10 6 10 12 13 11 …$ Daniels.D. : int 11 12 10 10 11 6 11 7 7 4 …$ Pattern.D. : int 13 16 7 7 11 10 13 6 5 6 …MDS in R Manly Example 11.2 voting behaviour of congressmen. GoF improves but coordinate system does not change. 2Donly average.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.> # metric MDS> (fit1 (fit2 # plot solution> x y plot(x, y, xlab=”Coordinate 1″, ylab=”Coordinate 2″, main=”Metric MDS”, type=”n”)> text(x, y, labels = row.names(cm), cex=.7)MDS in R Manly Example 11.2 voting behaviour of congressmen. Stress is very low. Remember the original distance matrix was just a differencein counts. The same 8 congressman stand apart for the clusteredgroup.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.> library(vegan)> (cmmds sratmax)> plot(cmmds,choices = c(1,2))MDS In both of these examples we have started with a distancematrix. Please make sure you work through the tutorial exampleswhich start with data sets that require construction of thedistance matrix prior to MDS analysis. Also in the tutorial is a comparison of metric MDS and PCA.Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.