# Multivariate

Unsupervised clustering is achieved using Principal Component Analysis (PCA), Principal coordinates analysis (PcOA), non-metric multidimensional scaling (NMDS) [1][2], Networks and hierarchical clustering. Unsupervised clustering allows identification of sample clusters and hidden data structures. Unsupervised cluttering further allows identification of outliers, which could potentially be problematic and may need to be removed from the analysis.

Supervised multivariate methods allow identification of complex microbiota-environment interactions. P-values indicate significant associations, i.e. if factors explain significant amour of variation in community composition. The following supervised methods are provided:

## Contents

## Multivariate Methods

All multivariate methods provided by Calypso are related.

### CCA and CCA+

Canonical correspondence analysis (CCA) [3] uses dissimilarity matrixes to test if sample groups are significantly different (i.e. if they have different community profiles as measured by the selected distance metric). [4]. CCA provides a p-value for each explanatory variable indicating if this variable significantly explains variation in community profiles. CCA included primary group, secondary group and pair to explain variation in community distances. CCA+ includes all environmental variable specified in the meta annotation file to explain variation in community distances. These variables are provided by the user in the uploaded meta annotation file.

Canonical Correspondence analysis (CCA) is a supervised, multivariate technique similar to RDA. CCA can analyze complex bacteria-environment interactions and identifies variance in the relative abundance of bacteria that can be attributed to each included environmental variable. For each environmental variable a p-value is given, indicating if this variable significantly impacts bacterial community composition (CCA+). <p/> CCA identifies if sample groups significantly affect community composition. A PCA like figure is shown and a p-value is calculated, indicating if the sample groups significantly impact community composition. The following command is executed in R: cca(relative OTU/taxa abundance ~ sample groups) <p/> CCA-F assesses if primary group, secondary group and pair affect bacterial community composition in a multivariate approach. The following command is executed in R: cca(relative OTU/taxa abundance ~ primaryGroup + secondaryGroup + pair). <p/> "CCA+" identifies environmental variables impacting bacterial community composition in a multivariate analysis. For each environmental variable a p-value is giving, which indicates the significance of the association. The following command is run in R: cca(relative OTU/taxa abundance ~ env1 + env2 + env3 + ...), where env1, env2, ... are environmental variables.

### RDA and RDA+

Redundancy analysis, a popular multivariate supervised ordination method in ecology (see vegan R package for more details). RDA allows identification of complex microbiome-environment interactions.

RDA extracts variation in the microbial community profiles that can be explained by a set of explanatory variable (either sample groups (RDA) or environmental variables (RDA+)) [5]. P-values indicate if explanatory variables are significantly associated with variation in the data matrix (abundance of individual taxa). P-values <0.05 indicate that a variable is significantly associated with variations in the abundance of specific taxa.

Age, gender and location are all significantly associated with variation in the microbial communities of the example data set.

RDA is implemented using the rda() function from the vegan R package. rda() is run with parameters scale=T and na.action="na.omit". All other parameters are set to their default values. The following command is run in R: rda(OTU-table ~ sample groups).

RDA+ identifies if variance in the community composition can be attributed to environmental variables in a multivariate approach. The following command is executed in R: rda(OTU-table ~ env1 + env2 + env3 + ...), where env1, env2, ... are environmental variables.

### PcOA

A popular ordination method in ecology. Given a matrix of pair-wise distances between samples, a PCoA visualizes these in a 2 dimensional plot as best as possible. The Euclidian distance of two samples in a PCoA plot represents their pair-wise distance in the original matrix as best as possible.

#### Interpretation

Distance of samples in 2 dimensional plot represents their similarity in community composition.

### PCA

PCA ordinates samples in two dimensions according to the main variance in the dataset. PCA+ overlays environmental variables.

### Anosim

Also Anosim uses dissimilarity matrixes to test if sample groups are significantly different (i.e. if they have different community profiles as measured by the selected distance metric). [6] Anosim provides a single p-value indicating if community profiles are significantly different between sample groups. The p-value is calculated by comparing intra-group distances with between-group distances.

### Adonis

Adonis is a multivariate technique analogous to MANOVA and describes if variation in community composition can be attributed to different experimental treatments or control variables (see vegan R package for more details). Adonis describes if the community composition is different between groups. Adonis-F describes if variance in community composition can be attributed to primary groups, secondary groups and pair. Adonis+ describes if variance in community composition can be attributed to the different environmental variables.

### DCA

Detrended correspondence analysis (DCA). A method widely used in ecology to find the main factors or gradients in large, species-rich but usually sparse data matrices [7][8].

### DAPC

Discriminant Analysis of Principal Components as described by Jombart et al. BMC Genetics 2010.

### Heatmap+

Pearson's correlation between abundance of taxa and environmental parameters visualized as color code. Recommended color palettes: "yellowblue", "redgreen", "heat" and "heat2".

### Permdisp2

Analyses if global community composition is significantly different between groups. PERMDISP2 visualizes the distances of each sample to the group centroid in a PCoA and gives a p-value for the significance of the grouping. See vegan R package for more details.

## Distance-based methods

Distance plots visualize the distance/similarity of the community profiles.

Available plots are PCoA, hierarchical clustering, heatmaps, CCA, PERMDISP2, anosim and adonis.

First, the pairwise distance of all sample pairs is computed.

Different distance measures are available, including Jaccard, Yue & Clayton, Euclidian distance and Chao distance. Subsequently, the pairwise distance matrix is visualized by PCoA, dendrogram, heatmap, CCA, PERMDISP2, or anosim. For example, you can first run a PCoA to visually identify if samples from each group cluster together. The significance of the clustering can then be identified by PERMDISP2 or anosim.

To test if the intra-group distances are smaller than the inter-group distances,

run PERMDISP2, anosim, or "GlobCommunityComp" in the GroupPlots view.

## Other methods

### Network

Samples are represented as nodes, samples with a similar community composition are connected with edges (similarity > Edge Min Similarity). Set the Edge Min Similarity parameter to increase/decrease the number of connected samples. The used distance index can be set by "Distance Method". Recommendation: Bray-Curtis.

### Heatmap

Visualises similarities of community profiles as heatmpal.

### SVM LOOC

SVM leave one out cross-validation (SVM LOOC) can be used to assess if two sample groups show a different bacterial community composition, or if the community profiles can be used to predict the group (class) of samples. For example, the method could be used to analyze if the gut microbiota shows a different composition between subjects with and without diabetes or to assess if the gut microbiota composition is a predictive biomarker for diabetes.

SVM LOOC works iteratively. In each step, one sample is excluded and an SVM is trained to distinguish between two sample groups (e.g. diabetes and non-diabetes). The trained SVM is then applied to predict the group affiliation of the excluded sample. This is repeated iteratively for each sample. Finally, the predicted group affiliations are compared with the original group labels to calculate the classification accuracy (percentage of correctly predicted group labels), sensitivity (TP/(TP+FN) and specificity (TP/(TP+FP)).