# Calypso

URL Calypso server: http://cgenome.net/calypso/

## Contents

- 1 Disclaimer
- 2 Screenshots
- 3 Published articles that have used Calypso
- 4 Introduction
- 5 Tutorial
- 5.1 Input data
- 5.2 Start demo project
- 5.3 Upload own data files
- 5.4 Output
- 5.5 Main navigation menu
- 5.6 Summary
- 5.7 Rarefaction analysis
- 5.8 Sample: Visualise microbial community composition
- 5.9 Compare taxa abundance across sample groups
- 5.10 Stats: Statistical comparison of sample groups
- 5.11 Multiv: Multivariate analysis
- 5.11.1 Correlation heatmap
- 5.11.2 Principal component analysis (PCA)
- 5.11.3 Canonical correspondence analysis (CCA)
- 5.11.4 Principal coordinates analysis (PCoA)
- 5.11.5 Redundancy analysis (RDA+)
- 5.11.6 Support vector machine classification (SVM)
- 5.11.7 Non-metric multidimensional scaling (NMDS)
- 5.11.8 Anosim
- 5.11.9 Adonis (permutational manova (PERMANOVA))
- 5.11.10 DCA
- 5.11.11 DAPC
- 5.11.12 Permdisp2
- 5.11.13 Network

- 5.12 Div: Analysis of microbial diversity
- 5.13 Regression: Multivariate linear regression
- 5.14 Network: Network analysis
- 5.15 BM: Biomarker discovery
- 5.16 Hierarchy: Use taxonomy reference for visualizations
- 5.17 Paired: Paired analysis
- 5.18 Norm: Examine effects of data normalisation/transformation
- 5.19 FA: Factor Analysis
- 5.20 General options
- 5.21 Implementation
- 5.22 FAQ
- 5.22.1 Error message: An internal error occurred, likely you either didn't upload a data and meta data file or your session has expired
- 5.22.2 Figure labels are only partially displayed
- 5.22.3 CCA changes if "Color by" is changed, but CCA+ does not
- 5.22.4 Figure labels overlap
- 5.22.5 Figure legend overlaps with chart
- 5.22.6 Figures are completely white
- 5.22.7 Error message: Internal ERROR: null dataMatrix
- 5.22.8 Error message: j_id_id15:resolution: Validation Error: Value is not of the correct type
- 5.22.9 The figure in Network+ does not contain my environmental variables
- 5.22.10 Warning: no environmental variables provided

## Disclaimer

This program is provided in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. The software may be used at your own risk. If you decide to use Calypso in published work, it is YOUR responsibility to ensure the correctness and consistency of the data.

## Screenshots

## Published articles that have used Calypso

- Umu ÖC, Frank JA, Fangel JU, Oostindjer M, da Silva CS, Bolhuis EJ, Bosch G, Willats WG, Pope PB, Diep DB. Resistant starch diet induces change in the swine microbiome and a predominance of beneficial bacterial populations (2015). Microbiome.
- Simeoni U, Berger B, Junick J, Blaut M, Pecquet S, Rezzonico E, Grathwohl D, Sprenger N, Brüssow H; Study team, Szajewska H. Gut microbiota analysis reveals a marked shift to bifidobacteria by a starter infant formula containing a symbiotic of bovine milk-derived oligosaccharides and Bifidobacterium animalis subsp. lactis CNCM I-3446. Environ Microbiol. 2015
- Ainsworth T, Krause L, …[12 authors]…, Leggat W. The coral core microbiome identifies rare bacterial taxa as ubiquitous endosymbionts (2015). ISME.
- Giacomin P, Zakrzewski M, Croese J, Su X, Sotillo J, McCann L, Navarro S, Mitreva M, Krause L, Loukas A, Cantacessi C. Experimental hookworm infection and escalating gluten challenges are associated with increased microbial richness in celiac subjects (2015). Scientific Reports.
- Smith DJ, Badrick AC, Zakrzewski M, Krause L, Bell SC, Anderson GJ, Reid DW. Pyrosequencing reveals transient cystic fibrosis lung microbiome changes with intravenous antibiotics (2014). The European respiratory journal.
- Dewar ML, Arnould JP, Krause L, Trathan P, Dann P, Smith SC. Influence of fasting during moult on the faecal microbiota of penguins (2014). PLoS One.
- Swe PM, Zakrzewski M, Kelly A, Krause L, Fischer K. Scabies mites alter the skin microbiome and promote growth of opportunistic pathogens in a porcine model (2014). PLoS Negl Trop Dis, 8 (5).
- Cantacessi C, Giacomin P, Croese J, Zakrzewski M, Sotillo J, McCann L, Nolan MJ, Mitreva M, Krause L**, Loukas A**. Impact of experimental hookworm infection on the human gut microbiota (2014). J Infect Dis.
- Dewar ML, Arnould JP, Krause L, Dann P, Smith SC. Interspecific variations in the faecal microbiota of Procellariiform seabirds (2014). FEMS Microbiol Ecol.
- Zhang L, Gowardman J, Morrison M, Krause L, Playford EG, Rickard CM. Molecular investigation of bacterial communities on intravascular catheters: no longer just Staphylococcus (2014). Eur J Clin Microbiol Infect Dis, 33 (7)
- Plieskatt JL, Deenonpoe R, Mulvenna JP, Krause L, Sripa B, Bethony JM, Brindley PJ. Infection with the carcinogenic liver fluke Opisthorchis viverrini modifies intestinal and biliary microbiome (2013). FASEB J.
- Reis, M., Roy, N., Bermingham, E., Ryan, L., Bibiloni, R., Young, W., Krause, L., Berger, B., North, M., Stelwagen, K., Reis, M. Impact of dietary dairy polar lipids on lipid metabolism of young mice fed high fat diet (2013). Journal of Agricultural and Food Chemistry.

# Introduction

Calypso is a powerful, yet easy to use tool for the higher-level analysis of taxonomic information from metagenomic datasets. Powerful visualization techniques are provided, generating high quality figures that can readily be used in scientific publications. Taxonomic information can be generated using high-throughput sequencing of bacterial 16S rDNA or using metagenomics. The software is accessible via a user-friendly web-interface at http://cgenome.net/calypso/. It provides a broad range of data-mining techniques allowing to perform, at all taxonomic levels, quantitative visualizations and comparisons of composition (e.g. boxplots, bubbleplots, interactive hierarchical trees and heatmaps), parametric and non-parametric statistical tests, univariate and multivariate analysis, supervised learning, correlation heatmaps, network analysis, multivariate regression, and diversity estimates. Multivariate statistics are powerful techniques that can identify complex environment-microbiota associations, in which differences in microbial composition can be attributed to multiple variables. Calypso enables lab-based researchers, who may be unfamiliar with advanced statistical software packages, to use these complex types of analysis routinely in their work.

# Tutorial

## Input data

As input Calypso requires a data matrix representing the number of sequences assigned to each taxon (or OTU) and an annotation file providing meta information for each sample. Optionally, a distance matrix and a reference taxonomy can be uploaded. Detailed information about the format of the input files can be found on the data upload page.

### Data matrix file

The data matrix represents the frequency of each taxon in each sample. Usually the matrix represents the number of 16S rDNA sequences or the number of metagenomic reads assigned to each taxon (or OTU). Data can be pre-processed, e.g. normalised by rarefaction analysis or transformed. However, rarefaction analysis in Calypso will only yield meaningful results, if raw counts are uploaded. Various file formats are supported including the common biom-format, which allows direct upload of pre-processed files generated by other analysis pipelines, such as QIIME, mother, MG-RAST or MetaPhlAn.

### Meta annotation file

The annotation file provides meta information for each sample, including sample group (e.g. case/control). Additionally, multiple optional factors (also called environmental parameters) can be provided, which are explanatory variables. Factors can be independent variables (frequently manipulated by the experimenter, e.g. case/control) or confounding factors (e.g. age, gender, BMI). In Calypso, factors can be discrete (e.g. case/control, female/male, geography) or numeric (e.g. age, days after treatment, BMI, blood glucose level). Discrete variables must be defined using non-numeric values (e.g. male/female instead of 0/1). Numeric factors must only contain numbers and must not contain any non-numeric value (e.g. "unknown"). Missing values must be encoded as "NA". Complex associations between multiple factors and community composition can be inferred using multivariate statistical methods.

### Distance matrix

An optional matrix of pair-wise community distances can be uploaded, which facilitates data analysis using the UniFrac metric.

### Taxonomy file

An optional reference taxonomy can be uploaded, which enables interactive visualization of the community composition as hierarchical trees.

## Start demo project

Before taking this tutorial please upload an example dataset. The tutorial provides links to Calypso web-pages. These links will not be functional, if no data has been imported.

To start the demo project:

- Go to the Data Upload Page
- Press "Start Demo Project" to automatically upload an example data, meta annotation file, taxonomy reference file and distance matrix.

As example project, Calypso uses a 16S rDNA dataset previously published by Yatsunenko et al. In a cross-sectional study, Yatsunenko et al. analyzed the fecal microbiota of 531 individuals in respect to age, gender, geographic location, kinship and diet of the included infants (breast milk versus formula). Subjects had an age range of 3 months to 83 years and were sampled from metropolitan regions in the United States, rural communities in Malawi and Amerindian villages in Venezuela. Only a subset of samples and taxa were included in the example data files to reduce processing time when exploring Calypso using the example dataset.

Reference:

Yatsunenko T, Rey FE, Manary MJ, Trehan I, Dominguez-Bello MG, Contreras M, Magris M, Hidalgo G, Baldassano RN, Anokhin AP, Heath AC, Warner B, Reeder J, Kuczynski J, Caporaso JG, Lozupone CA, Lauber C, Clemente JC, Knights D, Knight R, Gordon JI. Human gut microbiome viewed across age and geography. Nature. 2012 May 9;486(7402):222-7.

### Demo project counts file

The number of 16S sequences assigned to each bacterial taxa of the Yatsunenko et al. dataset were downloaded from the Gordon laboratory web-page at Washington University.

Download the example counts file.

### Demo project meta data file

Metadata for the Yatsunenko et al. dataset were downloaded from the Gordon laboratory web-page at Washington University. Metadata included age, geographic location, gender, family id and diet during infancy.

The Calypso annotation file was created in Excel using the downloaded metadata. Pair (3rd column) represents the family id of each subject. Primary group (4rd column) represents sample location (rural or metropolitan). The secondary group (5th column) represents the age group of each subject (0, 1, 2-3, 5-17, > 17 years of age).

The following factors were included:

- Gender: Male/Female/NA (not available)
- BMI: numeric value or NA (not available)
- Age: numeric value or NA (not availalbe)
- Location: Malawi/Venezuela/USA

Download the example meta data file.

## Upload own data files

The data upload wiki page provides detailed information on how to create and upload your own data and meta information files.

## Output

Publication-ready figures can be generated in PNG, PDF or SVG format. Resolution, width and height can be specified as well as the used colour palette. SVG is a XML-based vector graphics format and SVG images can be searched and edited in vector graphics editors, such as Inkscape, GIMP, CorelDraw or Adobe Illustrator. This allows post-processing of generated figures and to change colors and font sizes or adding additional labels or features.

Calypso is accessible via a free web-server, which provides a user-friendly interface to advanced statistical methods and data-mining algorithms. The top menu provides links to the various data analysis pages and a help page.

## Summary

The Summary page (Sum) provides a basic descriptive overview of your data, in particular the number of sequence reads per sample.

Displayed are: The number of reads per sample, reads per taxon or OTU, and the number of samples in which each taxon/OTU has been detected.

**Tutorial:**

- Open the Summary page. Make sure that the Demo Project has been loaded first, as described above.
- Set Type to ReadsPerSample and press SelectMode
- Press DrawChart

The figure shows the distribution of the number of reads per sample. The number of reads are plotted on the x-axis, the number of samples on the y-axis.

## Rarefaction analysis

The coverage of the original microbial communities by metagenomic sequence data is estimated by rarefaction analysis. Microbial sequences are randomly drawn from each sample. For each subsample, the number of observed species is counted and plotted as a function of the number of sampled sequences. The slope of the rarefaction curve indicates if the underlying microbial community is well represented by the sequence data. A steep slope indicates that a large fraction of the species diversity remains to be discovered. If the curve becomes flatter to the right, a reasonable number of sequence reads has been obtained and more intensive sampling is likely to yield only few additional species [1].

**Tutorial**

- Open the Rarefaction Analysis page. Make sure that the Demo Project has been loaded first, as described above.
- Select a taxonomic level (Recommendation: OTU or species)
- Press DrawChart

The rarefaction curves obtained for the example dataset (OTU based) flatten to the right, indicating that the underlying microbial communities are well covered by the sequence data. However, since the curves are still increasing, some bacterial species have been missed and complete coverage of the microbial diversity would require deeper sequencing.

## Sample: Visualise microbial community composition

Microbial community composition is visualized quantitatively using heatmaps, bubble plots, scatter plots, stripcharts, barcharts or boxplots. The SamplePlots Page uses bubble plots, heatmaps and bar charts to visualise measurements by square size, colour code and bar height, respectively. When visualizing data in heatmaps, Calypso allows trimming of outliers, selection of a wide range of color palettes and adjustment of the color range center.

### Generate Boxplot

- Open the SamplePlots Page. Make sure that the Demo Project has been loaded first, as described above.
- Set level to OTU
- Set chart type to Boxplot
- Set order to "GroupS-GroupP-Pair" to order the samples by secondary group (age group), then by primary group and finally by family.
- Press "Select Mode"
- Choose a colour palette
- Set filter to 30 to include the 30 most abundant OTUs in the box plot
- Press "Draw Chart"
- To obtain high-quality figures for publication, set figure resolution, width and height.

The following figure was obtained for the top 30 OTUs of the demo project. The x-axis represents samples and y-axis relative OTU counts. Samples are colored and ordered by the secondary group (age group).

### Generate Barchart

- Open the SamplePlots Page. Make sure that the Demo Project has been loaded first, as described above.
- Set level to Family
- Set chart type to Barchart
- Set order to "GroupS-GroupP-Pair" to order the samples by secondary group (age group), then by primary group and finally by family.
- Press "Select Mode"
- Choose a colour palette
- Set filter to 10 to only show the top 10 taxa with the highest mean across all samples
- Press "Draw Chart"
- To obtain high-quality figures for publication, set figure resolution, width and height. Increase the width and/or height if labels are displayed only partially or overlap, or if the legend and chart overlap
- Colors are randomly assigned to each taxa by default. Press Draw Chart again to re-assign colours or choose a specific colour palette.

The following bar chart is obtained for the demo project. Barchart are useful for presenting the abundance of the dominant bacterial groups. The figure below shows that the family Bifidobacteriaceae (orange) is dominating the gut microbiota of infants, while Ruminococcaceae and Lachnospiraceaea are more prevalent in adults.

### Generate heatmap

- Set the taxonomic level to family
- Set chart type to Heatmap+
- Press
*Select Mode* - Select the BlueGoldRed colour palette
- Set filter to 0 to include all taxa
- Select the ReOrderSamples checkbox. If ReOrderSamples is selected, samples will be ordered by hierarchical clustering. Otherwise they will be ordered as selected in the drop down menu Order.
- Unselect the
*scale data*checkbox. If*scale data*is selected, values will be scaled in range 0-1. - Set image resolution to 200, width to 530 and height to 200. Width and/or height can be increased if labels are displayed only partially or overlap, or if the legend and chart overlap
- Press
*Draw Chart*

Rows of the generated heatmap represent taxa, columns represent samples. Both are order by hierarchical clustering. Taxa abundance are presented in colour code, ranging from red (highly abundant) to blue (rare or absent). Values of factors are presented as a seperate heatmap on top of the main heatmap.

The following figure is obtained for the example dataset. Samples cluster by age. The families Lachnospiraceae, Ruminococcaceae and Prevotellaceae dominate the gut microbiota of older samples, whereas Bifidobacteriaceae was more abundant in infants.

#### Next:

- Set
*Remove outliers, trim values by*to 40. All values above 40 will be trimmed to 40. This can be used to trim outliers. - Press
*Draw Chart*

#### Next:

- Set
*Color centre*to 0.5. Color centre changes the distribution of assigned colours and allows shifting the colour palette. This can be used to increase the resolution for low data values. For example, if BlueGoldRed is chosen as colour palette, and Colour centre is set to 0.5, then gold is assigned to values of around 0.5; blue is assigned to values<0.5 and red is assigned to values >0.5 (for values in range 0-1). - Press
*Draw Chart*

The following heatmap is obtained for the example dataset:

### Details

## Compare taxa abundance across sample groups

Switch to the GroupPlots Page. Measurements across sample groups are compared by parametric and non-parametric statistical tests, including anova, nested anova, t-test, paired t-test, Bayesian t-test, Wilcoxon-rank test and Mann–Whitney U test. Significant different features are shown as box plot, bar chart or strip chart.

#### Tutorial

- Open the GroupPlots Page. Make sure that the Demo Project has been loaded, as described above.
- Select a taxonomic level, e.g. genus
- Select a sample group (either primary or secondary group; these groups are defined in the meta information file uploaded by the user).
- Press
*Select Mode* - Set
*plot type*. AnovaPlot compares measurements (taxa counts in the selected demo project) by Anova and visualises taxa names with significantly different measurements as bar chart, with standard deviation as error bars. RankTest compares data values by non-parametric rank test (Kruskal-Wallis). Significantly differentially features are visualised as bar chart. Error bars depict standard deviation. Significance of differences is depicted as: * (p<0.05), ** (p<0.01) and *** (p<0.001) - Set a
*significance threshold*. Only features that are significantly different with a p-value below this threshold are shown - Press
*Draw Chart*

Among the high abundant genera, *Ruminococcus, Prevotella, Clostridum, Bacteroides* and *Blautia* differ between individuals from metropolitan (USA) and the rural population (Malawi, Venezuela).

## Stats: Statistical comparison of sample groups

The Stats Page allows an in-depth statistical comparison of taxa abundances across sample groups. Taxa abundances are compared by parametric and non-parametric statistical tests, including anova, nested anova, t-test, paired t-test, Bayesian t-test (Cyber-T), negative Binominal distribution (DESeq2), Wilcoxon-rank test and Kruskal-Wallis test. P-values are adjusted for multiple testing by FDR, Benjamini-Hochberg or Bonferoni correction.

#### Tutorial

- Open the Stats Page. Make sure that the Demo Project has been loaded first, as described above.
- Set the taxonomic level to genus.
- Select a statistical test e.g.
*anova*or*rank test*.*Rank test*compares measurements by Wilcoxon rank test for two groups and Kruskal-Wallist test for more than two sample groups - Set filter to 50. Only the top 50 taxa with the highest mean abundance across all samples are included.
- Press
*Select Mode* - Select
*sample group*(either primary, secondary group or any of the provided environmental factors; these groups are defined in the meta information file uploaded by the user). - Press
*Do stats* - Click on the table header to sort table, e.g. by p-value.

Results of the statistical analysis are presented as table. Shown are p-values, Bonferroni corrected p-values, false discovery rate, Benjamine-Hochberg corrected p-values [2] and mean in each group. The following table depicts only the 20 most significantly different genera in the gut microbiota between different locations.

Distribution of p-values is presented as histrogram and quantile-quantile (QQ) plot. In the lower left figure, the uniform (expected) p-value distribution is indicated by the red line 'Expected'. QQ plots characterize the extend to which the observed distribution of the tests statistics follows the expected (null) distribution. This allows the detection of evidence for systemic bias.

### Random forest classification

Random forest classification can be applied to examine complex associations between microbial community composition and a study variable (e.g. location or age in the example dataset).

The following tutorial explains how to apply a random forest analysis to identify genera predictive of geographic location:

- Open the Stats Page. Make sure that the Demo Project has been loaded first, as described above.
- Select the genus level
- Set statistical test to "RandomForest"
- Press "Select Mode"
- Set filter to 20; to reduce computing/waiting time, only the top 20 genera will be included
- Set "Group by" to the environmental variable "Location".
- Press "Do stats"
- Click on the table header to sort table by "Score (Mean Decrease Accuracy) "

Genera most predictive of geographic location are then shown in the top of the table.

Random forest is implemented in R:

rf<-randomForest(as.factor(groups)~x,importance=T,proximity=F,ntree=10000,mtry=20) im<-as.data.frame(importance(rf)) score<-im$MeanDecreaseAccuracy

where groups represent the selected sample group and x represents the data matrix.

### Details Stats Page

## Multiv: Multivariate analysis

The Multivariate Page facilitates multivariate data visualisation and multivariate statistical testing. Multivariate statistics are powerful techniques that can identify complex associations between community compostion and multiple factors. In Calypso, complex associations can be examined by the multivariate methods principal component analysis (PCA), redundancy analysis (RDA), canonical correspondence analysis (CCA), detrended correspondence analysis (DCA), non-metric multidimensional scaling (NMDS), hierarchical clustering, heatmaps, correlation networks and multivariate regression. Correlation networks visualize the positive and negative associations between taxa, between factors, and between taxa and factors.

To explore complex associations between community composition and multiple factors using multivariate statistics, set type to Heatmap+, RDA+ or CCA+. If RDA+ or CCA+ are chosen, all factors defined in the meta annotation file are included in one single coherent model. For each factor a p-value is computed, indicating if the factor is significantly associated with community composition (i.e. if the factor significantly explains variation in the counts data file (RDA) or variation in sample distances (CCA+)).

GUide to STasitical Analysis in Microbial Ecology (GUSTA ME) provides an excellent guide for the multivariate analysis of microbial community composition.

### Correlation heatmap

Pearson's correlation between factors and community composition are shown as heatmap.

- Set level to OTU
- Set type to Heatmap+
- Press "Select Mode"
- Press "Draw Chart"

In the figure below, positive correlations are shown in red, negative correlations in blue. Rows represent factors and columns taxa (or OTUs).

### Principal component analysis (PCA)

PCA allows data visualisation as 2D plot, identifying sample clusters and determining potentially problematic samples (outliers), which may need to be excluded from downstream analysis (e.g. by setting the include flag to 0 in the meta data file). PCA ordinates samples in two dimensions according to the main variance in the dataset. PCA+ overlays environmental variables.

- Open the Multivariate Page. Make sure that the demo project has been loaded first, as described above.
- Set level to OTU
- Set type to PCA
- Press
*Select mode* - Set colour to blueYellowRed
- Set
*Group/Colour by*to Location. Samples in the PCA plot will then be coloured according to their Location. - Set hull to
*Filled Spider*. In the PCA plot, samples of one group will be connected by lines as a spider plot. - Press
*Draw Chart*

The following PCA plot is obtained for the demo project. Samples cluster by location. Samples collected from metropolitan populations in the USA form a separate cluster from samples collected in rural populations from Malawi and Venezuela. Significance of observed clusters can be tested by CCA (next section).

### Canonical correspondence analysis (CCA)

The PCA plot presented above indicates that samples cluster by location. The statistical significance of the observed clustering can be tested by CCA.

- Set level to OTU
- Set type to CCA
- Press "Select mode"
- Set "Group/Colour by" to Location. A CCA will be run for the selected group.
- Set distance metric to Euclidian.
- Press "Draw Chart"

CCA is a multivariate method that is used to explore complex associations between measured variables and multiple explanatory variables (or confounding factors). CCA tests if variations in the data matrix can be explained by the selected sample group (the sample group selected under the drop down menu "Group/Colour by").

Two figures are generated. The first shows an 2D ordination plot, indicating how well samples can be separated according to the sample groups. The second plot provides a p-value, indicating if the sample location significantly explains variations in the sample distances, or in other words if samples cluster significantly by location.

The following result is obtained for the demo project.

According to these results, location significantly explains variations observed in the gut microbiota (p=0.001).

### Principal coordinates analysis (PCoA)

A popular ordination method in ecology. Given a matrix of pair-wise distances between samples, a PCoA visualizes these in a 2 dimensional plot as best as possible. The Euclidian distance of two samples in a PCoA plot represents their pair-wise distance in the original matrix as best as possible. PCoA allows data visualisation as 2D plot, identifying sample clusters and determining potentially problematic samples (outliers), which may need to be excluded from downstream analysis.

- Open the Multivariate Page. Make sure that the Demo Project has been loaded first, as described above.
- Set level to OTU
- Set type to PCoA
- Press "Select mode"
- Set distance metric to "Euclidian". Measurment-profiles can be compared by a wide range of distance metrics, including Euclidian, Manhattan, inverse Pearson's correlation and Bray-Curtis index.
- Set "Group/Color by" to Secondary Group (age group), also set "Symbol by" to Secondary Group
- Select default in the color palette
- Press "Draw Chart"

The figure illustrates that the samples are separated by age along the x-axis.

### Redundancy analysis (RDA+)

RDA+ includes all factors defined in the meta annotation file.

- Set level to OTU
- Set type to RDA+
- Press "Select mode"
- Set "Group by" to "Primary Group" (location)
- Set colour to yellowblue2
- Set filter to 50; only the top 50 OTUs will be included to speed up processing/waiting time
- Press Draw Chart

RDA is a multivariate method that is used to explore complex associations between community composition and multiple factors. All factors defined in the meta annotation file are included in the multivariate analysis. Two figures are generated. The first shows an 2D ordination plot, indicating how well samples can be separated according to the defined factors. Samples will be coloured by the variable selected under "Group/Colour by". The p-vlaues reported in the second figure indicate if each factor is significantly associated with variation in the data matrix (i.e. if the factor significantly explains variation in sample distances).

The following result is obtained for the example dataset. Location (Malawi, Venezuela, USA), age, BMI and gender are significantly associated with variation in gut microbial composition (p<0.05).

### Support vector machine classification (SVM)

The discriminatory power of the uploaded data to distinguish between two sample groups (e.g. cases vs control) can be examined using a Support Vector Machine evaluated by leave one out cross validation. Or in other words, SVM leave one out cross-validation (SVM LOOC) can be used to assess if microbial community composition is predictive of sample groups. The classification performance is described by overall accuracy, sensitivity and specificity.

For example, using the demo project SVM LOOC can be employed to examine if the gut microbiota is predictive of age (e.g. infants vs adults) or location (e.g. rural vs metropolitian).

The selected sample group must have exactly two different values, e.g. case/control, protected/unprotected, male/female.

SVM LOOC works iteratively. In each step, one sample is excluded and an SVM is trained to discriminate between two classes. The trained SVM is then applied to predict the class of the excluded sample. This is repeated iteratively for each sample. Finally, the predicted class is compared with the known class of each sample to calculate the classification accuracy (percentage of correctly predicted class labels), sensitivity (TP/(TP+FN) and specificity (TP/(TP+FP)).

#### Tutorial

- Set level to OTU
- Set type to SVM LOOC
- Press "Select mode"
- "Group/Colour by" to "Secondary group", to run a SVM LOOC for the secondary group variable (Rural vs Metropolitan subjects).
- Set Filter to 20. SVM LOOC is a time consuming analysis.
- Press "Draw Chart"

The top 20 OTUs are able to predict the location with 88% accuracy. 92% of the samples collected in rural (Malawi, Venezuela) locations were correctly classified into this class (sensitivity = 0.92) and 88% of the samples obtained from metropolitan (USA) individuals were correctly classified into this class (sensitivity = 0.88).

### Non-metric multidimensional scaling (NMDS)

Non-metric multidimensional scaling produces an ordination based on a distance or dissimilarity matrix. NMDS attempts to represent, as closely as possible, the pairwise dissimilarity between objects in low-dimensional space. In Calypso, NMDS is implemented in R using the vegan metaMDS() function. Pair-wise sample distances are computed using the Bray-Curtis dissimilarity. See GUSTA ME for more details.

### Anosim

Anosim uses dissimilarity matrixes to test if sample groups are significantly different (i.e. if they have different community profiles as measured by the selected distance metric). [3] Anosim provides a single p-value indicating if community profiles are significantly different between sample groups. The p-value is calculated by comparing intra-group distances with between-group distances.

### Adonis (permutational manova (PERMANOVA))

Adonis is a multivariate technique analogous to MANOVA and describes if variation in community composition can be attributed to different experimental treatments or control variables (see vegan R package for more details). Adonis describes if the community composition is different between groups. Adonis-F describes if variance in community composition can be attributed to primary groups, secondary groups and pair. Adonis+ describes if variance in community composition can be attributed to the different environmental variables.

### DCA

Detrended correspondence analysis (DCA). An ordination method widely used in ecology. The method is related to correspondence analysis but avoids the "arch effect". DCA is implemented in R using the vegan decorana() function. [4][5][6].

### DAPC

Discriminant Analysis of Principal Components is a multivariate method designed to identify and describe clusters of genetically related individuals Jombart et al. BMC Genetics 2010. DAPC is implemented in R using the adegenet dapc() function.

### Permdisp2

Tests if global community composition is significantly different between groups. PERMDISP2 visualizes the distances of each sample to the group centroid in a PCoA and provides a p-value for the significance of the grouping. Permdisp2 is implemented in R using the vegan betadisper() function.

### Network

Samples are represented as nodes, samples with a similar community composition are connected with edges (similarity > Edge Min Similarity). Set the Edge Min Similarity parameter to increase/decrease the number of connections. The used dissimilarity can be set by "Distance Method". Recommendation: Bray-Curtis.

## Div: Analysis of microbial diversity

Diversity of microbial communities can be visualized as boxplots, stripcharts, barcharts and rarefaction curves. Different diversity measures are integrated into Calypso. "Shannon index" measures the overall diversity of each community, including both number of present taxa/OTUs and evenness. "Richness" measures the number of present taxa/OTUs. "Evenness" measures how evenly abundant the present taxa/OTUs are.

- Open the Diversity Page. Make sure that the Demo Project has been loaded first, as described above.
- Set level to "OTU"
- Set index to Shannon and press "Draw Chart"

The figure shows an increased diversity with age. The p-value above the stripchart confirms that the change is significant.

### Example

### Details

"Shannon index" measures overall diversity, including both number of present OTUs and evenness. "Richness" measures the number of present OTUs. "Evenness" measures how evenly abundant the present OTUs are. "AbundancePlot" displays the size distribution (relative number of assigned reads) of OTUs. Error bars in stripcharts visualize standard deviation. As species richness depends on sample size, richness is estimated by rarefaction analysis. Expected species richness in calculated from random subsamples of size minTotal from the community, where minTotal is total number of reads of the smallest sample. Richness is computed using the rarefy() R function of the vegan package.

#### mcpHill

mcpHill simultaneous investigates several diversity measures by unifying them in one of the same mathematical family of indices. These families (Hill numbers) represent a variety of useful diversity indices. Families with q<0 emphasise on rare species and families with q>2 emphasise on abundant species (q represents the order of Hill numbers). The advantage of this procedure is that a researcher does not have to commit to a particular diversity measure but instead examines multiple indices simultaniously. P-values are corrected for multiple testing. The method is implemented in R using the mcpHill() function of the simboot package: mcpHill(data=counts,fact=group,boot=1000, mattype="Tukey"), where counts is the counts data matrix (number of sequence reads assigned to each taxon and group represents the selected sample grouping. More details can be found in: Pallmann et al, Assessing group differences in biodiversity by simultaneously testing a user-defined selection of diversity indices, 2012.

## Regression: Multivariate linear regression

The Regression Page facilitates multivariate linear regression. Multivariate linear regression is a powerful techniques that can identify complex associations between community composition and multiple factors. Multiple co-variates can be included. Additional information can be found here.

#### Identify taxon-factor associations by multivariate regression

- Open the Regression Page. Make sure that the Demo Project has been loaded first, as described above.
- Set level to
*Genus* - Set
*Regress by*to*Taxa vs Envp*and press*Set Mode* - Set filter to 50 to restrict the analysis to the top 50 genera with the highest mean across samples
- Press
*Run Analysis*

The displayed table shows associations between multiple factors (as defined in the meta information file) and taxa abundance. For each taxon a regression model is fit, including the taxon as dependent variable and all factors as explanatory variables:

Taxon = fa1 + fa2 + fa3 …,

where fa1, fa2, ... are factors. P-values are shown for each factor-taxa combination, indicating the significance of associations. Click on the header to sort the displayed table, e.g. by p-value.

The following result table is calculated for the demo project. The table was sorted by p-values obtained for factor age group (by clicking on Age.p in the header of the table). Ten genera are significantly associated with age (p<0.05), indicated in blue. In total, five bacterial groups on genus level are still significantly associated with age group after correction for multiple testing by FDR (column Age.p.fdr, red box). Significant associations for other factors can be viewed by re-sorting the table.

#### Explore associations between taxa and selected factor by multivariate regression

- Open the Regression Page. Make sure that the Demo Project has been loaded first, as described above.
- Set
*Regress by*to*Age' to explore associations between age and abundance of bacterial genera* - Press
*Set Mode* - Press
*Run Analysis* - Click on "P" in the generated table to order data by p-value

For each genus the results table lists the correlation *R* between age and abundance of that genus. A p-value is given, indicating the statistical significance of the observed correlation. Additionally, the table presents the mean abundance across all samples and the number of positive samples, where the genus has been detected (abundance >0).

The top genera with the highest abs(R^{2}) (absolute of correlation coefficient) are selected and the association of these genera with age is examined by multivariate regression. This model incorporates age as dependent variable and the top genera as explanatory variables: age ~ genus 1 + genus 2 + ….

Tesults are presented as tables and figures. For each included genus the p-value computed by multivariate regression is reported, indicating if the genus is significantly associated with age. Additionally, a scatter plot is shown, plotting the abundance of the genus (x-axis) versus the age (y-axis). The plots show that Bifidobacteria decreases with age, whereas Oscillospira increases.

#### Identify diversity-factor associations by multivariate regression

- Open the Regression Page. Make sure that the Demo Project has been loaded first, as described above.
- Set level to "OTU"
- Set "Regress by" to "Diversity vs Envp" and press "Set Mode"
- Set index to Shannon
- Set filter to 50 to restrict the analysis to the top 50 OTUs with the highest mean across samples
- Press "Run Analysis"

The table and figures display associations between the community diversity (Shannon index) and multiple factos (as defined in the meta information file). A regression model is fit, including the diversity as dependent variable and all factors as explanatory variables:

diversity = fa1 + fa2 + fa3 …, where fa1, fa2, ... are factors.

P-values indicate statistical significance of associations.

For numeric factors, scatterplots visualize the correlation of the factor with diversity. The p-value indicates the significance of the Pearson correlation between the community diversity and the factpr. In the second scatterplot, the diversity index is controlled for all remaining factors (the controlled correlation is called "partial correlation"). For example the association between diversity and gender is controlled for BMI, age and location (by linear regression).

For discrete factors, boxplots along with an ANOVA-test describe the significance of the factor.

The table below indicates that age and locations are significantly associated with Shannon diversity. The second scatterplot illustrates a positive correlation between age and Shannon index (corrected for the remaining factors gender, location). The correlation is significant (Pearson correlation, p=0.024).

### Details

### Examples

## Network: Network analysis

Network analysis visualised correlations (edges) between taxa (nodes). Positive correlations are shown in yellow, negative correlations in blue. The network analysis identifies co-occurring bacteria, mutual exclusive bacteria and clusters of co-occurring bacteria.

Correlations are measured by Pearson's or Spearman correlation. "Network+" also includes numeric environmental variables. Select "Layout By Correlation" to layout nodes based on a PCoA.

#### Tutorial

- Open the Network Page. Make sure that the Demo Project has been loaded first, as described above.
- Set Type to Network+ to do a network analysis of features and all factors defined in the uploaded meta information file.
- Set color to select a black or white background colour
- Correlation coefficient can be set to either Pearson's correlation or Spearman coefficient
- Press Draw Chart

Only correlations larger than Edge Min Similarity or smaller than -1 * Edge Min Similarity are presented as edges.

The following network is obtained for the demo project. Yellow edges indicate positive correlations and blue represent negative correlation. Lightblue notes describe features and red nodes factors. The figure indicates clustering according to location.

### Example

## BM: Biomarker discovery

Calypso provides simple yet powerful methods for biomarker discovery. Predictive biomarkers associated with two sample groups (e.g. cases and controls; responders and non-responsers) are identified by t-test, Wilcoxon rank test, nested anova, logistic regression or the random forest classifier. The discriminatory power of biomarker candidates is described by the area under the ROC curve (AUC), odds ratio, delta (difference in means in units of standard deviation) or fold change.

Biomarker discovery is only possible for variables/factors with exactly two different values (e.g. cases and control).

#### Tutorial

- Open the Biomarker Page. Make sure that the Demo Project has been loaded first, as described above.
- Set level to
*Genus* - Set
*Group by*to*Primary Group* - Set filter to 30 to only include the top 30 features with highest mean value across samples
- Press
*Draw Chart*

For each feature a p-value, adjusted p-value (FDR, Benjamini-Hochberg and Bonferroni), Area Under the ROC Curve (AUC with 95% upper and lower
confidence intervals), odds ratio (with 95% upper and lower confidence intervals), delta
(difference in mean divided by standard deviation), and fold change are calculated. Odds ratios
are visualized as forest plots. Together these values indicate if each feature is a potential
predictive biomarker for classifying samples into the two classes.

If test is set to *LogisticRegression*, p-value are calculated by logistic regression, incorporating the
selected group as dependent variable and all factors as
explanatory variables (group ~ feature + factor 1 + factor 2 + factor 3 + .... All factors defined in the meta annotation file are included.
Also AUC and odds ratio will be adjusted for all
factors. In detail: odds ratios are calculated by logistic regression, all factors are incorporated as explanatory variables.

The following plot shows the forest plot obtained for the demo project.

## Hierarchy: Use taxonomy reference for visualizations

This Calypso feature is only enabled when a taxonomy file has been uploaded.

### Generate taxonomic trees

#### Tutorial

- Open the Hierarchy Page. Make sure that the Demo Project has been loaded first, as described above.
- Set type of visualization to Dendogram
- Press "Select Mode"
- Set "Group by" to "Primary group"
- Press Draw Chart
- Click on node to expand or hide subtrees
- Press "DownloadPNG" below the figure to save the image

### Krona plots

#### Tutorial

- Open the Hierarchy Page. Make sure that the Demo Project has been loaded first, as described above.
- Select
*Krona*as visualization type - Press
*Select Mode* - Set
*Group by*to*Primary group* - Press
*Draw Chart* - Click on
*Click here*to open Krona in a new window. - A new browser window will open displaying Krona pie charts of community composition
- Select
*Metropolitan*in the selection list (left upper corner) - Decrease the max depth by clicking on "-" to view the taxonomic level family
- Increase the font size by clicking on "+"
- Enable the
*Collapse*checkbox to combine taxonomic levels, i.e. taxa with only one child will be merged - Double-click
*Firmicutes*in the pie chart to set this phylum as the root - Click on
*all*in the centre of the Krona plot to view the complete taxonomic hierarchy - Click
*Snapshot*to export the Krona plot - Save the Krona Chart by right-clicking
*Save Page As...*

More information about Krona visualizations can be found on the Krona's website

### Example

### Details Hierarchy Plots

## Paired: Paired analysis

The Pairwise Page facilitates paired statistical analysis. Paired analysis makes use of paired study designs, where several samples where taken from the same individual (e.g. before and after treatment) or from MZ twin pairs in a case/control study. Comparisons are done by paired t-test or paired Wilcoxon rank test.

### Details

## Norm: Examine effects of data normalisation/transformation

The Norm Page allows to examine the effects of different normalisation and transformation methods on the data distribution. Raw data (as uploaded to Calypso) and transformed data are shown. This does not have any effect on the data used by other Calypso pages.

## FA: Factor Analysis

The FA Page facilitates factor analysis, a data reduction and structure detection method. Factor analysis is a statistical method used to reduce the number of variables and to identify structures in the relationships of variables. The aim is to reduce a large and complex dataset to a low number of factors without loosing relevant information.

Factor analysis describes the variability among observed, correlated variables in terms of a lower number of unobserved variables called factors. Assume a dataset with hundreds or thousands of variables and only a few samples. Most likely many variables are highly correlating and contain redundant information. The basic idea of factor analysis is to combine multiple correlating variables into one representative factor. The original data is then described by the lower number of factors, which approximated the original variables.

Factor analysis is implemented in R using the nmf() function from the NMF package. The default algorithms is Brunet. The rank (number of factors) is set by the user.

## General options

The following table described general Calypso options that are present on most analysis pages.

Field | Description |
---|---|

Figure Format | The file format of generated figures (PNG, PDF or SVG). |

Level | The taxonomic level (if provided: superkingdom, phylum, class, order, genus, species) or OTU |

Order | The order of samples in the generated figures. Samples can be ordered by their primary group, secondary group, pair, or label. |

Filter | Selects how many of the top features (highest mean across samples) are included. Set to 0 to include all taxa. |

Color | Color palette for plots |

Secondary Group | Used to filter samples by their secondary group. Select "All" to include all samples. |

Distance | Distance metric for computing pair-wise distances of samples. |

Resolution | Resolution of the generated plot in dpi. Allowed range: 20-1000 |

Width | Width of generated plot in mm. |

Height | Height of generated plot in mm. |

## Implementation

The Calypso web-frontend is implemented in Java using the JavaServer Faces architecture. Interactive views are facilitated by the Javascript library D3.js. The backend is implemented in Perl and the R statistical programming language. No installation, configuration, registration or login is required. Data is kept privately and cannot be viewed by other users. Uploaded data and calculated results are deleted after the users session terminates.

## FAQ

### Error message: An internal error occurred, likely you either didn't upload a data and meta data file or your session has expired

Before using Calypso, you need to upload a data and meta data file. Select Home in the top menu and upload a data and meta data file or press "Start Demo Project". Make sure that you have activated java script in your browser settings and allow cookies. If you receive this error message after pressing "Start Demo Project" your browser does not allow cookies.

You also receive this error message if your session has expired. Your session expires if you haven't used Calypso for more than 60 minutes.

### Figure labels are only partially displayed

Increase the figure width and height.

### CCA changes if "Color by" is changed, but CCA+ does not

CCA provides a p-value if variance observed in the community composition can be explained by the sample groups. The sample grouping can be selected by "Color by". CCA+ describes if variance in the community composition can be attributed to environmental variables. CCA+ does not include the sample groups for the statistical analysis. However, samples are colored by the grouping selected under "Color by".

### Figure labels overlap

Increase the figure width and height.

### Figure legend overlaps with chart

Increase the figure width and height.

### Figures are completely white

Solution: Reduce resolution of figure or increase figure width and height.

### Error message: Internal ERROR: null dataMatrix

Likely you forgot to upload a data and/or meta data file. Select Home in the top menu and upload a data and meta data file.

### Error message: j_id_id15:resolution: Validation Error: Value is not of the correct type

You have entered an incorrect value in one of the text fields. Valid value ranges are: Resolution: 20-1000 Width: 20-10000 Height: 20-10000 Min proportion: 0-100, real numbers are supported, e.g. 0.2

### The figure in Network+ does not contain my environmental variables

Please check if your annotation file includes the environmental variables. Calypso sets the columns 7 and the following columns of the meta data files as the environmental variables.

### Warning: no environmental variables provided

Please check if your meta data file includes the environmental variables. Calypso sets the columns 7 and the following columns of the meta data files as the environmental variables. Without the environmental variables, methods such as CCA+, RDA+ cannot be executed.