# Calypso Multivariate Details

## Contents

### Multivariate Analysis

The following unsupervised multivariate methods are available: Network, PCA, PCoA, Dendrogram, Heatmap, DCA, and NMDS.

Additionally, the following supervised multivariate techniques are provided: RDA, CCA, and DAPC.

The impact of environmental variables on the community composition can be assessed by PCA+, DCA+, NMDS+, RDA+, CCA+, Heatmap+, and Hierarchical clustering.

## PCoA

Given a matrix of pair-wise distances between samples, a PCoA visualizes these in a 2 dimensional plot as best as possible. The Euclidian distance of two samples in a PCoA plot represents their pair-wise distance in the original matrix is best as possible. Interpretation: Distance of samples in 2 dimensional plot represents their similarity in community composition.

## PERMDISP2

Method for measuring if global community composition is significantly different between groups. PERMDISP2 visualizes the distances of each sample to the group centroid in a PCoA and gives a p-value for the significance of the grouping.

## ANOSIM

Method for measuring if global community composition is significantly different between groups. Anosim compares between group distances and within group distances (distances of community profiles).

==HierarchicalClustering & HierarchricalClustering+

## Network

Samples are represented as nodes, samples with a similar community composition are connected by an edge (with a similarity > Edge Min Similarity). Set the Edge Min Similarity parameter to increase/decrease the number of connected samples. The used distance index can be set by "Distance Method".

## Heatmap & Heatmap+

The heatmap visualizes the similarity/distance of the community composition of samples by a color code. The distance index can be set by the parameter "Distance Method".

## PCA & PCA+

Principal component analysis (PCA) visualizes the similarity/difference of samples according to the main variance in the dataset. PCA+ overlays environmental variables over the regular PCA. PC1 is the axis representing most of the variance of the dataset, PC2 is the axis representing the second most variance (variance in terms of the rel. abundance of different taxa).

## DAPC

Discriminant Analysis of Principal Components as described by Jombart et al., BMC Genetics 2010.

## RDA & RDA+

Redundancy analysis, a popular supervised ordination method in ecology. RDA allows assessing if variance in the community composition profiles can be attributed to the sample groups or environmental variables. For example, you can first plot a PCA to visually assess if samples from each group cluster together. You can then calculate the significance of these clusters by RDA.

"RDA" identifies if the community composition is significantly different between the different sample groups. The significance of the clustering is given as p-value. The following command is executed in R: rda(OTU-table ~ sample groups) (vegan), where OTU-table is a matrix of relative OTU abundance.

"RDA+" identifies if variance in the community composition can be attributed to environmental variables. The following command is executed in R: rda(OTU-table ~ env1 + env2 + env3 + ...), where env1, env2, ... are environmental variables and OTU-table is a matrix of relative OTU abundance.

## DCA & DCA+

Detrended correspondence analysis. For more information, please read GUSTA ME

## NMDS & NMDS+

Non-metric multidimensional scaling (NMDS) rank orders to represent data in a low-dimensional space.

## CCA & CCA+

Canonical Correspondence analysis (CCA) is a supervised, multivariate technique related to PCA and PCoA. CCA can identify complex associations between two data matrices. Or in more detail, if some of the variance observed in one data matrix can be explained by variance observed in the second matrix. In the case of Calypso, the first data matrix is the OTU/taxa abundance matrix, the second is either 1) the sample grouping ("CCA"); 2) primary group, secondary group and pair ("CCA-F"); or 3) A matrix of all environmental variables ("CCA+").

"CCA" assesses if the community composition is significantly different between the different sample groups. A PCA like figure is shown and a p-value is calculated, indicating if the sample groups significantly impact the community composition. The following command is executed in R: cca(relative OTU/taxa abundance ~ sample groups)

"CCA+" identifies environmental variables effecting community composition. The significance of the association is given as p-value. The following command is run in R: cca(relative OTU/taxa abundance ~ env1 + env2 + env3 + ...), where env1, env2, ... are environmental variables.

## SVM LOOC

Support Vector Machine (SVM) combined with leave-one-out calculation (LOOC) works iteratively. In each step, one sample is excluded and an SVM is trained to discriminate between two classes. The trained SVM is then applied to predict the class of the excluded sample. This is repeated iteratively for each sample. Finally, the predicted class is compared with the known class of each sample to calculate the classification accuracy (percentage of correctly predicted class labels), sensitivity (TP/(TP+FN) and specificity (TP/(TP+FP)).

## Adonis & Adonis+

Adonis is analogous to MANOVA and describes how variation in community composition can be attributed to different experimental treatments or control variables.

Adonis describes if the community composition is different between groups.

Adonis+ describes if variance in community composition can be attributed to the different environmental variables.