Calypso Help
Calypso is an easy-to-use online software, allowing non-expert users to mine, interpret and compare taxonomic information from metagenomic or 16S rDNA datasets. The software is free for academic use.
Reference:
Zakrzewski M, Proietti C, Ellis J, Hasan S, Brion MJ, Berger B, Krause L (2016) Calypso: A User-Friendly Web-Server for Mining and Visualizing Microbiome-Environment Interactions. Bioinformatics [1].
Contents
- 1 Calypso Forum
- 2 Overview of Data-Mining and Statistical Analysis in Calypso
- 2.1 Data normalization and transformation
- 2.2 Quality control
- 2.3 Quantitative representation of microbial composition data
- 2.4 Cluster analysis and sample ordination
- 2.5 Identification of microbiome-environment associations
- 2.6 Identification of associations between the environment and abundance of individual taxa
- 2.7 Biomarker discovery
- 3 Tutorial
- 3.1 Input data
- 3.2 Start demo project
- 3.3 Upload own data files
- 3.4 Calypso output formats
- 3.5 Main navigation menu
- 3.6 Summary
- 3.7 Rarefaction analysis
- 3.8 Sample: Visualise microbial community composition
- 3.9 Group: Compare taxa abundance across sample groups
- 3.10 Visualize abundance of individual taxa
- 3.11 Stats: Statistical comparison of sample groups
- 3.12 Multivariate analysis
- 3.12.1 Correlation heatmap
- 3.12.2 Principal component analysis (PCA)
- 3.12.3 Canonical correspondence analysis (CCA)
- 3.12.4 Principal coordinates analysis (PCoA)
- 3.12.5 Redundancy analysis (RDA+)
- 3.12.6 Support vector machine classification (SVM)
- 3.12.7 Non-metric multidimensional scaling (NMDS)
- 3.12.8 Anosim
- 3.12.9 Adonis (permutational manova (PERMANOVA))
- 3.12.10 DCA
- 3.12.11 PLS
- 3.12.12 DAPC
- 3.12.13 Permdisp2
- 3.12.14 Network
- 3.13 Diversity: Analysis of microbial diversity
- 3.14 Regression: Multivariable linear regression
- 3.14.1 Identify microbiome-environment associations by multiple regression
- 3.14.2 Explore associations between taxa and selected explanatory variable by multiple regression
- 3.14.3 Identify associations between microbial diversity and multiple explanatory variables by multiple regression
- 3.14.4 Details
- 3.15 Analysis of longitudinal data (time series; repeated measures)
- 3.16 Network: Network analysis
- 3.17 mixMC: mixOmics microbial community studies
- 3.18 Biomarker discovery
- 3.19 Hierarchy: Use taxonomy reference for visualizations
- 3.20 FS: Feature Selection
- 3.21 Paired: Paired analysis
- 3.22 Norm: Examine effects of data normalisation/transformation
- 3.23 FA: Factor Analysis
- 3.24 General options
- 3.25 Implementation
- 3.26 FAQ
- 3.26.1 How to cite Calypso
- 3.26.2 Error message: An internal error occurred, likely you either didn't upload a data and meta data file or your session has expired
- 3.26.3 Figure labels are only partially displayed
- 3.26.4 CCA changes if "Color by" is changed, but CCA+ does not
- 3.26.5 Figure labels overlap
- 3.26.6 Figure legend overlaps with chart
- 3.26.7 Figures are completely white
- 3.26.8 Error message: Internal ERROR: null dataMatrix
- 3.26.9 Error message: j_id_id15:resolution: Validation Error: Value is not of the correct type
- 3.26.10 The figure in Network+ does not contain my environmental variables
- 3.26.11 Warning: no environmental variables provided
Calypso Forum
In addition to the information found on this help wiki, the Calypso User Group provides a public forum for asking questions, searching previous questions, and sharing tips regarding Calypso. Post to the forum if you have any questions regarding Calypso, including analysis methods, interpretation of results, parameters, data pre-processing, bug-reports, or suggestions for improvements.
Overview of Data-Mining and Statistical Analysis in Calypso
Data normalization and transformation
Calypso provides various transformation methods to account for the generally non-normal distribution of microbial community composition data. To render the data suitable for analysis by standard statistical procedures, community profiles can be transformed in Calypso by log, total sum normalization (TSS), asinh, square root, quantile normalization, and variance stabilization and normalization for microarray data (vsn). Sequencing based community profiling yields a special data type called “compositional data”, which is characterized by specific intrinsic properties that can deteriorate statistical analysis. In particular, the measured relative abundance of microbial taxa depends on the abundance of all other taxa. To remove the non-independence of relative bacterial abundance, Calypso facilitates data transformation by centered log ratio, which is one of the most widely used transformations for compositional data. The effects of transformation and normalization methods on the data distribution can be visualized in Calypso using boxplots and scatterplots.
Quality control
Calypso enables the implementation of standardized quality control procedures for microbiome data. The distribution of sequences per sample can be plotted as a histogram and potentially problematic samples (outliers) can be detected by hierarchical clustering, Principal Components Analysis (PCA), or Principal Coordinate Analysis (PCoA). The coverage of the underlying microbial communities by metagenomic sequence reads can be estimated by rarefaction analysis. In a rarefaction analysis, microbial sequences are randomly drawn from each sample. For each subsample, the number of observed species is counted and plotted as a function of the number of sampled sequences. The slope of the rarefaction curve indicates if the underlying microbial community is well represented by the sequence data.
Quantitative representation of microbial composition data
Calypso provides a powerful toolbox for presenting microbial composition data quantitatively, including heatmaps, bubble plots, scatter plots, stripcharts, bar charts, and boxplots. Hierarchical relationships can be depicted using interactive Krona plots and taxonomic trees. Krona plots allow assessment of the hierarchical structure of microbial communities using interactive pie charts. Calypso implements a module for visualizing hierarchical relationships either as interactive dendrograms or interactive radial trees. Bar charts can be used to depict the abundance of each node (taxa) in each sample.
Cluster analysis and sample ordination
In Calypso, the unsupervised grouping of samples with similar community composition into clusters is achieved by hierarchical clustering, visualized either as a dendrogram or heatmap. Calypso provides powerful heatmaps for cluster identification, to visualize microbial community composition, and for identifying associations between community composition and the environment. Heatmaps can be fine-tuned according to user preferences, for components such as the color pallet, trimming of outliers, and the center value of the color pallet. Meta information for each sample is displayed as color bars on top of heatmaps. Calypso provides several ordination methods that allow visualization of community composition data as 2D plots. Samples can be ordinated by the unsupervised methods principal component analysis (PCA), principal coordinates analysis (PCoA), detrended correspondence analysis (DCA), and non-metric multidimensional scaling (NMDS). All of these pattern-discovery methods are commonly used in microbial ecology for identifying clusters of samples with similar community composition and for identifying potentially problematic samples (outliers).
Identification of microbiome-environment associations
In Calypso, associations between microbial community composition and a single environmental variable can be identified by testing for homogeneity of variances (PERMDISP2), or by comparing intra-group and inter-group community distances using Anosim. Additionally, a wide range of multivariate methods is provided. These are powerful techniques for identifying complex environment-microbiota associations, in which differences in microbial composition can be attributed to multiple variables. Calypso implements the supervised multivariate methods redundancy analysis (RDA), canonical correspondence analysis (CCA), and permutational manova (PERMANOVA/Adonis). These methods test if variance in community composition can be explained by variance in multiple explanatory variables. A separate p-value is provided for each included variable. RDA is widely used in ecology and summarizes linear relationships between components of the community composition matrix and a set of explanatory variables. CCA, PERMANOVA and Adonis rely on the comparison of community dissimilarities and can be run on UniFrac distances or using dissimilarity indices, such as the Bray-Curtis, Jaccard, Yue & Clayton, or Chao indices.
Identification of associations between the environment and abundance of individual taxa
Abundance of individual taxa can be compared in Calypso using parametric tests (Anova, nested Anova, t-test, paired t-test, Bayesian t-test) or non-parametric tests (Wilcoxon-rank test and Kruskal-Wallis test). Additionally, differences in taxa abundances can be identified using tests specifically developed for counts data: DESeq2, ANCOM, and ALDEx2. Calculated p-values are adjusted for multiple testing by Bonferroni correction and False-Discovery-Rate (FDR). Complex microbiome-environment interactions can also be examined using multiple linear regression. These are powerful techniques that facilitate identification of associations between individual taxa and multiple explanatory variables. Multivariable paired data can be analyzed by mixed effect regression, which incorporates the paired variable (e.g. subject, animal or cage) as a random effect and other explanatory variables (e.g. case/control or treatment) as fixed effects. These methods can distinguish between group-specific effects (e.g. case/control) and subject or cage-specific effects.
Biomarker discovery
Calypso provides methods for biomarker discovery, which identify bacterial taxa predictive of an outcome of interest (e.g. responders/non-responders, disease/healthy, high risk/low risk). The discriminatory power of microbial community profiles to distinguish between two biological conditions is characterized by a Support Vector Machine evaluated by leave-one-out cross-validation. The classification performance is described by overall accuracy, sensitivity and specificity. The discriminatory power of individual taxa can be assessed by area under the ROC curve (AUC), odds ratio, delta (difference in means in units of standard deviation), and fold change. Advanced feature selection methods in Calypso facilitate selection of the optimal subset of taxa predictive of an outcome of interest. These methods further allow the identification of relevant taxa associated with an explanatory variable. In Calypso, feature selection methods are based on the premise that many taxa are either redundant (highly correlated) or irrelevant, and can thus be removed without much loss of information. Calypso implements the widely used feature selection methods step-wise linear regression, LASSO regularized regression, and random forest. LASSO performs both feature selection and regularization to prevent overfitting. Taxa selected by stepwise regression or LASSO regularized regression are presented in Calypso as bar charts, where bars depict the importance of each taxa (Akaike information criterion [AIC] of the model if the taxa was dropped from the model or absolute of the t-statistic, respectively). Random forest identifies the subset of most relevant features by constructing a collection of decision trees. Variance is controlled by constructing trees incorporating only a random subset of the features, which in turn avoids overfitting. In Calypso, the results of the random forest analysis are presented as bar chart, where bars represent the relative importance of taxa, as estimated by random permutation.
Tutorial
Input data
As input Calypso requires a data matrix representing read counts (number of sequences assigned to each taxon or OTU) and an annotation file providing meta information for each sample. Optionally, a distance matrix and a reference taxonomy can be uploaded. Detailed information about the format of the input files can be found on the data upload page.
Data matrix file
The data matrix represents the frequency of each taxon in each sample. Usually the matrix represents the number of 16S rDNA sequences or the number of metagenomic reads assigned to each taxon (or OTU). Data can be pre-processed, e.g. normalised, rarefied or transformed. However, we recommend to upload non-normalized (raw) filtered read counts and to normalize the data using Calypso. Rarefaction analysis in Calypso only yields meaningful results, if raw read counts are uploaded. Various file formats are supported including the common biom-format, which allows direct upload of pre-processed data generated by other analysis pipelines, such as QIIME, mother, MG-RAST or MetaPhlAn. We recommend to upload data in biom format and to upload raw (non-normalized) filtered read counts. The data upload wiki page provides detailed information on how to prepare, upload and normalise your own data files.
Metadata file
The metadata file provides meta information for each sample, including sample group (e.g. case/control). Additionally, multiple optional explanatory variables(also called environmental variables) can be provided, which are used for multivariate analysis and multiple regression. Explanatory variables can represent variables manipulated by the experimenter (e.g. case/control), confounding factors (e.g. age, gender, BMI), or other variables potentially associated with community composition. In Calypso, explanatory variables can be discrete (e.g. case/control, female/male, geography) or numeric (e.g. age, days after treatment, BMI, blood glucose level). Discrete variables must be defined using non-numeric values (e.g. male/female instead of 0/1). Numeric variables must only contain numbers and must not contain any non-numeric value (e.g. "unknown"). Missing values must be encoded as "NA". Complex associations between multiple explanatory variables and community composition can be inferred using multivariate statistical methods and multiple regression. The data upload wiki page provides detailed information on how to prepare, upload and normalise your own data files.
Distance matrix
An optional matrix of pair-wise community distances can be uploaded, which facilitates data analysis using the UniFrac metric.
Taxonomy file
An optional reference taxonomy can be uploaded, which enables interactive visualization of the community composition as hierarchical trees.
Start demo project
Before taking this tutorial please upload an example dataset. The tutorial provides links to Calypso web-pages. These links will not be functional, if no data has been imported.
To start the demo project:
- Go to the Data Upload Page
- Press "Start Demo Project" to automatically upload an example data, meta annotation file, taxonomy reference file and distance matrix.
As example project, Calypso uses a 16S rDNA dataset previously published by Yatsunenko et al. In a cross-sectional study, Yatsunenko et al. analyzed the fecal microbiota of 531 individuals in respect to age, gender, geographic location, kinship and diet of the included infants (breast milk versus formula). Subjects had an age range of 3 months to 83 years and were sampled from metropolitan regions in the United States, rural communities in Malawi and Amerindian villages in Venezuela. Only a subset of samples and taxa were included in the example data files to reduce processing time when exploring Calypso using the example dataset.
Reference:
Yatsunenko T, Rey FE, Manary MJ, Trehan I, Dominguez-Bello MG, Contreras M, Magris M, Hidalgo G, Baldassano RN, Anokhin AP, Heath AC, Warner B, Reeder J, Kuczynski J, Caporaso JG, Lozupone CA, Lauber C, Clemente JC, Knights D, Knight R, Gordon JI. Human gut microbiome viewed across age and geography. Nature. 2012 May 9;486(7402):222-7.
Demo project data matrix
The number of 16S sequences assigned to each bacterial taxa of the Yatsunenko et al. dataset were downloaded from the Gordon laboratory web-page at Washington University.
Download the example counts file.
Demo project meta data file
Metadata for the Yatsunenko et al. dataset were downloaded from the Gordon laboratory web-page at Washington University. Metadata included age, geographic location, gender, family id and diet during infancy.
The Calypso annotation file was created in Excel using the downloaded metadata. Pair (3rd column) represents the family id of each subject. Primary group (4rd column) represents sample location (rural or metropolitan). The secondary group (5th column) represents the age group of each subject (0, 1, 2-3, 5-17, > 17 years of age).
The following explanatory variables were included:
- Gender: Male/Female/NA (not available)
- BMI: numeric value or NA (not available)
- Age: numeric value or NA (not availalbe)
- Location: Malawi/Venezuela/USA
Download an example meta data file in V3 format or
download an example meta data file in V6 format.
Upload own data files
The data upload wiki page provides detailed information on how to prepare, upload and normalise your own data files.
Calypso output formats
Results in Calypso can be presented as figures, sortable tables, comma-separated text-files, interactive hierarchical trees, or Krona diagrams. Users can exclude samples, filter bacterial groups, select colors and change figure properties (resolution and dimensions). Publication-quality images can be generated in either PNG, PDF or SVG format. SVG is a XML-based vector graphics format and images in this format can be edited in vector graphics editors, such as Inkscape. This allows post-processing of generated figures to change colors and font sizes, or adding additional labels or features.
Calypso is accessible via a free web-server, which provides a user-friendly interface to advanced statistical methods and data-mining algorithms. The top menu provides links to the various data analysis pages and a help page.
Summary
The Summary page (Sum) provides a basic descriptive overview of your data, in particular the number of sequence reads per sample.
Displayed are: The number of reads per sample, reads per taxon or OTU, and the number of samples in which each taxon/OTU has been detected.
Tutorial:
- Open the Summary page. Make sure that the Demo Project has been loaded first, as described above.
- Set Type to ReadsPerSample and press SelectMode
- Press DrawChart
The figure shows the distribution of the number of reads per sample. The number of reads are plotted on the x-axis, the number of samples on the y-axis.
Rarefaction analysis
The coverage of the original microbial communities by metagenomic sequence data is estimated by rarefaction analysis. Microbial sequences are randomly drawn from each sample. For each subsample, the number of observed species is counted and plotted as a function of the number of sampled sequences. The slope of the rarefaction curve indicates if the underlying microbial community is well represented by the sequence data. A steep slope indicates that a large fraction of the species diversity remains to be discovered. If the curve becomes flatter to the right, a reasonable number of sequence reads has been obtained and more intensive sampling is likely to yield only few additional species [2].
Tutorial
- Open the Rarefaction Analysis page. Make sure that the Demo Project has been loaded first, as described above.
- Select a taxonomic level (Recommendation: OTU or species)
- Press DrawChart
The rarefaction curves obtained for the example dataset (OTU based) flatten to the right, indicating that the underlying microbial communities are well covered by the sequence data. However, since the curves are still increasing, some bacterial species have been missed and complete coverage of the microbial diversity would require deeper sequencing.
Sample: Visualise microbial community composition
Microbial community composition is visualized quantitatively using heatmaps, bubble plots, scatter plots, stripcharts, barcharts or boxplots. The SamplePlots Page uses bubble plots, heatmaps and bar charts to visualise measurements by square size, colour code and bar height, respectively. When visualizing data in heatmaps, Calypso allows trimming of outliers, selection of a wide range of color palettes and adjustment of the color range center.
Generate Boxplot
- Open the SamplePlots Page. Make sure that the Demo Project has been loaded first, as described above.
- Set level to OTU
- Set chart type to Boxplot
- Set order to "GroupS-GroupP-Pair" to order the samples by secondary group (age group), then by primary group and finally by family.
- Press "Select Mode"
- Choose a colour palette
- Set filter to 30 to include the 30 most abundant OTUs in the box plot
- Press "Draw Chart"
- To obtain high-quality figures for publication, set figure resolution, width and height.
The following figure was obtained for the top 30 OTUs of the demo project. The x-axis represents samples and y-axis relative OTU counts. Samples are colored and ordered by the secondary group (age group).
Generate Barchart
- Open the SamplePlots Page. Make sure that the Demo Project has been loaded first, as described above.
- Set level to Family
- Set chart type to Barchart
- Set order to "GroupS-GroupP-Pair" to order the samples by secondary group (age group), then by primary group and finally by family.
- Press "Select Mode"
- Choose a colour palette
- Set filter to 10 to only show the top 10 taxa with the highest mean across all samples
- Press "Draw Chart"
- To obtain high-quality figures for publication, set figure resolution, width and height. Increase the width and/or height if labels are displayed only partially or overlap, or if the legend and chart overlap
- Colors are randomly assigned to each taxa by default. Press Draw Chart again to re-assign colours or choose a specific colour palette.
The following bar chart is obtained for the demo project. Barchart are useful for presenting the abundance of the dominant bacterial groups. The figure below shows that the family Bifidobacteriaceae (orange) is dominating the gut microbiota of infants, while Ruminococcaceae and Lachnospiraceaea are more prevalent in adults.
Generate heatmap
- Set the taxonomic level to family
- Set chart type to Heatmap+
- Press Select Mode
- Select the BlueGoldRed colour palette
- Set filter to 0 to include all taxa
- Select the ReOrderSamples checkbox. If ReOrderSamples is selected, samples will be ordered by hierarchical clustering. Otherwise they will be ordered as selected in the drop down menu Order.
- Unselect the scale data checkbox. If scale data is selected, values will be scaled in range 0-1.
- Set image resolution to 200, width to 530 and height to 200. Width and/or height can be increased if labels are displayed only partially or overlap, or if the legend and chart overlap
- Press Draw Chart
Rows of the generated heatmap represent taxa, columns represent samples. Both are order by hierarchical clustering. Taxa abundance are presented in colour code, ranging from red (highly abundant) to blue (rare or absent). Values of explanatory variables are presented as a seperate heatmap on top of the main heatmap.
The following figure is obtained for the example dataset. Samples cluster by age. The families Lachnospiraceae, Ruminococcaceae and Prevotellaceae dominate the gut microbiota of older samples, whereas Bifidobacteriaceae was more abundant in infants.
Next: Remove outliers
- Set Remove outliers, trim values by to 40. All values above 40 will be trimmed to 40. This can be used to trim outliers.
- Press Draw Chart
Next: Define color centre
- Set Color centre to 0.5. Color centre changes the distribution of assigned colours and allows shifting the colour palette. This can be used to increase the resolution for low data values. For example, if BlueGoldRed is chosen as colour palette, and Colour centre is set to 0.5, then gold is assigned to values of around 0.5; blue is assigned to values<0.5 and red is assigned to values >0.5 (for values in range 0-1).
- Press Draw Chart
The following heatmap is obtained for the example dataset:
Details
Group: Compare taxa abundance across sample groups
Switch to the Group Page. Measurements across sample groups are compared by parametric and non-parametric statistical tests, including anova, nested anova, t-test, paired t-test, Bayesian t-test, Wilcoxon-rank test and MannWhitney U test. Significant different taxa are shown as box plot, bar chart or strip chart. Details can be found here.
Tutorial
- Open the GroupPlots Page. Make sure that the Demo Project has been loaded, as described above.
- Select a taxonomic level, e.g. genus
- Select a sample group (either primary or secondary group; these groups are defined in the meta information file uploaded by the user).
- Press Select Mode
- Set plot type. AnovaPlot compares measurements (taxa counts in the selected demo project) by Anova and visualises taxa names with significantly different measurements as bar chart, with standard deviation as error bars.
- Set a significance threshold. Only taxa that are significantly different with a p-value below this threshold are shown
- Press Draw Chart
Significantly different taxa are shown as barchart (p<0.05, ANOVA). Standard error is depicted by error bars. Pair-wise comparisons are done by t-test and annotated as *: p<0.05, **: p<0.01, ***: p<0.001
Among the high abundant genera, Ruminococcus, Prevotella, Clostridum, Bacteroides and Blautia differ between individuals from metropolitan (USA) and the rural population (Malawi, Venezuela).
Visualize abundance of individual taxa
Switch to the Taxa Page. Abundance of individual taxa across sample groups are visualised by box plots or stripcharts. Significance of differences can be tested using parametric or non-parametric statistical tests (select a test that is appropriate for your data).
Tutorial
- Open the Taxa Page. Make sure that the Demo Project has been loaded, as described above.
- Select a taxonomic level (e.g. phylum) and press Select Mode
- Select the plot type (stripchart or box plot) via the Type drop down menu
- Select a microbial taxa of interest via the Taxa drop down menu
- Press Draw Chart
Outliers can be removed from the plot by setting the Remove outliers option. If a value >0 is specified, all samples with an abundance above this value are excluded from the plot (they are however included in the calculation of the median/mean and quartiles).
Stats: Statistical comparison of sample groups
The Stats Page allows an in-depth statistical comparison of taxa abundances across sample groups. Taxa abundances can be compared by parametric and non-parametric statistical tests, including anova, nested anova, t-test, paired t-test, Bayesian t-test (Cyber-T), negative Binominal distribution (DESeq2), ALDEx2, ANCOM, Wilcoxon-rank test and Kruskal-Wallis test. Select the test that is appropriate for your data. P-values are adjusted for multiple testing by FDR, Benjamini-Hochberg or Bonferoni correction. Taxa associated with different biological conditions can further be identified using the linear discriminant analysis (LDA) effect size method (LEfSe). LEfSe determines the taxa most likely to explain differences between classes by coupling standard tests for statistical significance with additional tests encoding biological consistency and effect relevance.
Tutorial
- Open the Stats Page. Make sure that the Demo Project has been loaded first, as described above.
- Set the taxonomic level to genus.
- Select a statistical test e.g. anova or rank test. Rank test compares measurements by Wilcoxon rank test for two groups and Kruskal-Wallist test for more than two sample groups
- Set filter to 50. Only the top 50 taxa with the highest mean abundance across all samples are included.
- Press Select Mode
- Select sample group (either primary, secondary group or any of the provided explanatory variables; these groups are defined in the meta information file uploaded by the user).
- Press Do stats
- Click on the table header to sort table, e.g. by p-value.
Results of the statistical analysis are presented as table. Shown are p-values, Bonferroni corrected p-values, false discovery rate, Benjamine-Hochberg corrected p-values [3] and mean in each group. The following table depicts only the 20 most significantly different genera in the gut microbiota between different locations.
Distribution of p-values is presented as histrogram and quantile-quantile (QQ) plot. In the lower left figure, the uniform (expected) p-value distribution is indicated by the red line 'Expected'. QQ plots characterize the extend to which the observed distribution of the tests statistics follows the expected (null) distribution. This allows the detection of evidence for systemic bias.
Statistical methods for counts data
Calypso provides several statistical methods that were developed specifically for the analysis of counts data, such as metagenomic, 16S, or RNAseq data. DESeq2 models read counts using the negative binomial distribution. ALDEx2 uses Bayesian methods to infer technical and statistical error. ALDEx2 accounts for the dependence of observations. For example, the relative abundance of dominant taxa measured by 16S rDNA sequencing is highly dependnet, which can impair analysis using standard statistical methods. ANCOM has been developed for analysing microbial community composition. Also ANCOM accounts for compositional constraints of metagenoimc data to reduce false discoveries in detecting differentially abundant taxa.
These methods operate directly on counts data. To use these methods, please upload raw read counts (non-normalized, non-rarefied and without any prior transformation).
The output of ANCOM is a table showing the significantly differentially abundant microbial taxa.
ALDEx2 generated the following output:
- we.ep the expected P value of the Welch’s t-test
- we.eBH the expected value of the Benjamini Hochberg corrected P value for the Welch’s t-test
- wi.ep the expected P value of the Wilcoxon test
- wi.eBH the expected value of the Benjamini Hochberg corrected P value of the Wilcoxon test
Random forest classification
Random forest classification can be applied to examine complex associations between microbial community composition and a study variable (e.g. location or age in the example dataset).
The following tutorial explains how to apply a random forest analysis to identify genera predictive of geographic location:
- Open the Stats Page. Make sure that the Demo Project has been loaded first, as described above.
- Select the genus level
- Set statistical test to "RandomForest"
- Press "Select Mode"
- Set filter to 20; to reduce computing/waiting time, only the top 20 genera will be included
- Set "Group by" to "Location".
- Press "Do stats"
- Click on the table header to sort table by "Score (Mean Decrease Accuracy) "
Genera most predictive of geographic location are then shown in the top of the table.
Random forest is implemented in R:
rf<-randomForest(as.factor(groups)~x,importance=T,proximity=F,ntree=10000,mtry=20) im<-as.data.frame(importance(rf)) score<-im$MeanDecreaseAccuracy
where groups represent the selected sample group and x represents the data matrix.
Details Stats Page
Multivariate analysis
The Multivariate Page facilitates multivariate data visualisation and multivariate statistical testing. Multivariate statistics are powerful techniques that can identify complex associations between community compostion and multiple explanatory variables. In Calypso, complex associations can be examined by the multivariate methods principal component analysis (PCA), redundancy analysis (RDA), canonical correspondence analysis (CCA), detrended correspondence analysis (DCA), non-metric multidimensional scaling (NMDS), hierarchical clustering, heatmaps, correlation networks and multivariable regression. Correlation networks visualize the positive and negative associations between taxa, between explanatory variables, and between taxa and explanatory variables.
To explore complex associations between community composition and multiple explanatory variables using multivariate statistics, set type to Heatmap+, RDA+ or CCA+. If RDA+ or CCA+ are chosen, all explanatory variables defined in the meta annotation file are included in one single coherent model. For each explanatory variable a p-value is computed, indicating if the variable is significantly associated with community composition (i.e. if the varialbe significantly explains variation in the counts data file (RDA) or variation in sample distances (CCA+)).
Some methods (e.g. Anosim) require categorical variables. In these cases, Calypso will automaticall categorize numeric values using the "cluster" method of the discretize() function implemented in the R package arules. The function assigns values to categories by k-means clustering.
GUide to STasitical Analysis in Microbial Ecology (GUSTA ME) provides an excellent guide for the multivariate analysis of microbial community composition.
Correlation heatmap
Pearson's correlation between explanatory variables and community composition are shown as heatmap.
- Set level to OTU
- Set type to Heatmap+
- Press "Select Mode"
- Press "Draw Chart"
In the figure below, positive correlations are shown in red, negative correlations in blue. Rows represent explanatory variables and columns taxa (or OTUs).
Principal component analysis (PCA)
PCA is an unsupervised ordination method and allows data visualisation as 2D plot, identifying sample clusters and determining potentially problematic samples (outliers), which may need to be excluded from downstream analysis (e.g. by setting the include flag to 0 in the meta data file). PCA ordinates samples in two dimensions according to the main variance in the dataset. PCA+ overlays explanatory variables.
- Open the Multivariate Page. Make sure that the demo project has been loaded first, as described above.
- Set level to OTU
- Set type to PCA
- Press Select mode
- Set colour to blueYellowRed
- Set Group/Colour by to Location. Samples in the PCA plot will then be coloured according to their Location.
- Set hull to Filled Spider. In the PCA plot, samples of one group will be connected by lines as a spider plot.
- Press Draw Chart
Hulls are generated using the ordihull() function from the vegan package with default parameters. Confidence intervals for eclipses are 0.95.
The following PCA plot is obtained for the demo project. Samples cluster by location. Samples collected from metropolitan populations in the USA form a separate cluster from samples collected in rural populations from Malawi and Venezuela. Significance of observed clusters can be tested by CCA (next section).
Canonical correspondence analysis (CCA)
The PCA plot presented above indicates that samples cluster by location. The statistical significance of the observed clustering can be tested by the supervised multivariate method Canonical Correspondence Analysis (CCA).
- Set level to OTU
- Set type to CCA
- Press "Select mode"
- Set "Group/Colour by" to Location. A CCA will be run for the selected group.
- Set distance metric to Euclidian.
- Press "Draw Chart"
CCA is a multivariate method that is used to explore complex associations between measured variables and multiple explanatory variables (or confounding factors). CCA tests if variations in the data matrix can be explained by the selected sample group (the sample group selected under the drop down menu "Group/Colour by").
Two figures are generated. The first shows an 2D ordination plot, indicating how well samples can be separated according to the sample groups. The second plot provides a p-value, indicating if the sample location significantly explains variations in the sample distances, or in other words if samples cluster significantly by location.
The following result is obtained for the demo project.
According to these results, location significantly explains variations observed in the gut microbiota (p=0.001).
Principal coordinates analysis (PCoA)
One of the most popular ordination methods in ecology. PcoA is unsupervised. Given a matrix of pair-wise distances between samples, a PCoA visualizes these in a 2 dimensional plot as best as possible. The Euclidian distance of two samples in a PCoA plot represents their pair-wise distance in the original matrix as best as possible. PCoA allows data visualisation as 2D plot, identifying sample clusters and determining potentially problematic samples (outliers), which may need to be excluded from downstream analysis.
- Open the Multivariate Page. Make sure that the Demo Project has been loaded first, as described above.
- Set level to OTU
- Set type to PCoA
- Press "Select mode"
- Set distance metric to "Euclidian". Measurment-profiles can be compared by a wide range of distance metrics, including Euclidian, Manhattan, inverse Pearson's correlation and Bray-Curtis index.
- Set "Group/Color by" to Secondary Group (age group), also set "Symbol by" to Secondary Group
- Select default in the color palette
- Press "Draw Chart"
The figure illustrates that the samples are separated by age along the x-axis.
Redundancy analysis (RDA+)
The supervised multivariate method RDA+ includes all explanatory variables defined in the meta annotation file.
- Set level to OTU
- Set type to RDA+
- Press "Select mode"
- Set "Group by" to "Primary Group" (location)
- Set colour to yellowblue2
- Set filter to 50; only the top 50 OTUs will be included to speed up processing/waiting time
- Press Draw Chart
RDA is a multivariate method that is used to explore complex associations between community composition and multiple explanatory variables. All explanatory variables defined in the meta annotation file are included in the multivariate analysis. Two figures are generated. The first shows an 2D ordination plot, indicating how well samples can be separated according to the defined explanatory variables. Samples will be coloured by the variable selected under "Group/Colour by". The p-vlaues reported in the second figure indicate if each explanatory variable is significantly associated with variation in the data matrix (i.e. if the variable significantly explains variation in sample distances).
The following result is obtained for the example dataset. Location (Malawi, Venezuela, USA), age, BMI and gender are significantly associated with variation in gut microbial composition (p<0.05).
Support vector machine classification (SVM)
The discriminatory power of the uploaded data to distinguish between two sample groups (e.g. cases vs control) can be examined using a Support Vector Machine evaluated by leave one out cross validation. Or in other words, SVM leave one out cross-validation (SVM LOOC) can be used to assess if microbial community composition is predictive of sample groups. The classification performance is described by overall accuracy, sensitivity and specificity.
For example, using the demo project SVM LOOC can be employed to examine if the gut microbiota is predictive of age (e.g. infants vs adults) or location (e.g. rural vs metropolitian).
The selected sample group must have exactly two different values, e.g. case/control, protected/unprotected, male/female.
SVM LOOC works iteratively. In each step, one sample is excluded and an SVM is trained to discriminate between two classes. The trained SVM is then applied to predict the class of the excluded sample. This is repeated iteratively for each sample. Finally, the predicted class is compared with the known class of each sample to calculate the classification accuracy (percentage of correctly predicted class labels), sensitivity (TP/(TP+FN) and specificity (TP/(TP+FP)).
Tutorial
- Set level to OTU
- Set type to SVM LOOC
- Press "Select mode"
- "Group/Colour by" to "Secondary group", to run a SVM LOOC for the secondary group variable (Rural vs Metropolitan subjects).
- Set Filter to 20. SVM LOOC is a time consuming analysis.
- Press "Draw Chart"
The top 20 OTUs are able to predict the location with 88% accuracy. 92% of the samples collected in rural (Malawi, Venezuela) locations were correctly classified into this class (sensitivity = 0.92) and 88% of the samples obtained from metropolitan (USA) individuals were correctly classified into this class (sensitivity = 0.88).
Non-metric multidimensional scaling (NMDS)
The unsupervised non-metric multidimensional scaling produces an ordination based on a distance or dissimilarity matrix. NMDS attempts to represent, as closely as possible, the pairwise dissimilarity between objects in low-dimensional space. In Calypso, NMDS is implemented in R using the vegan metaMDS() function. Pair-wise sample distances are computed using the Bray-Curtis dissimilarity. See GUSTA ME for more details.
Anosim
Anosim uses dissimilarity matrixes to test if sample groups are significantly different (i.e. if they have different community profiles as measured by the selected distance metric). [4] Anosim provides a single p-value indicating if community profiles are significantly different between sample groups. The p-value is calculated by comparing intra-group distances with between-group distances.
Adonis (permutational manova (PERMANOVA))
Adonis is a multivariate technique analogous to MANOVA and describes if variation in community composition can be attributed to different experimental treatments or control variables (see vegan R package for more details). Adonis describes if the community composition is different between groups. Adonis-F describes if variance in community composition can be attributed to primary groups, secondary groups and pair. Adonis+ describes if variance in community composition can be attributed to the different explanatory variables.
DCA
Detrended correspondence analysis (DCA). An ordination method widely used in ecology. The method is related to correspondence analysis but avoids the "arch effect". DCA is implemented in R using the vegan decorana() function. [5][6][7].
PLS
Partial least squares regression is a supervised multivariate method used in Calypso to identify taxa associated with multiple explanatory variables.
DAPC
Discriminant Analysis of Principal Components is a multivariate method designed to identify and describe clusters of related individuals Jombart et al. BMC Genetics 2010. DAPC is implemented in R using the adegenet dapc() function.
Permdisp2
Tests if global community composition is significantly different between groups. PERMDISP2 visualizes the distances of each sample to the group centroid in a PCoA and provides a p-value for the significance of the grouping. Permdisp2 is implemented in R using the vegan betadisper() function.
Network
Samples are represented as nodes, samples with a similar community composition are connected with edges (similarity > Edge Min Similarity). Set the Edge Min Similarity parameter to increase/decrease the number of connections. The used dissimilarity can be set by "Distance Method". Recommendation: Bray-Curtis.
Diversity: Analysis of microbial diversity
Diversity of microbial communities can be visualized as boxplots, stripcharts, barcharts and rarefaction curves. Calypso provides multiple metrics for measuring microbial alpha diversity, including Shannon index, evenness, richness, Simpson index, Chao 1, and Fisher’s Alpha. "Shannon index" measures the overall diversity of a community, including both number of present taxa/OTUs and evenness. "Richness" measures the number of present taxa/OTUs. "Evenness" measures how evenly abundant the present taxa/OTUs are. Community richness is estimated by rarefaction analysis to account for differences in sample sizes. Complex associations between microbial diversity and multiple explanatory variables are identified by multiple linear regression. Calypso further supports diversity analysis using mcpHill, which simultaneously investigates several diversity measures by unifying them in one of the same mathematical family of indices.
Tutorial
- Open the Diversity Page. Make sure that the Demo Project has been loaded first, as described above.
- Set level to "OTU"
- Set index to Shannon and press "Draw Chart"
The figure shows an increased diversity with age. The p-value above the stripchart confirms that the change is significant.
Details
"Shannon index" measures overall diversity, including both number of present OTUs and evenness. "Richness" measures the number of present OTUs. "Evenness" measures how evenly abundant the present OTUs are. "AbundancePlot" displays the size distribution (relative number of assigned reads) of OTUs. Error bars in stripcharts visualize standard deviation. "Anova plot" will mark significant differences by lines (with or without asterix). For datasets with a low number of study groups, significance is indicated by asterixes (* for P<0.05, ** for P<0.01 and *** for P<0.001). For datasets with many groups, only the lines are shown, but no asterix.
As species richness depends on sample size, richness is estimated by rarefaction
analysis. Expected species richness in calculated from random subsamples of size
minTotal from the community, where minTotal is total number of reads of the smallest sample.
Richness is computed using the rarefy() R function of the vegan package.
mcpHill
mcpHill simultaneous investigates several diversity measures by unifying them in one of the same mathematical family of indices. These families (Hill numbers) represent a variety of useful diversity indices. Families with q<0 emphasise on rare species and families with q>2 emphasise on abundant species (q represents the order of Hill numbers). The advantage of this procedure is that a researcher does not have to commit to a particular diversity measure but instead examines multiple indices simultaniously. P-values are corrected for multiple testing. The method is implemented in R using the mcpHill() function of the simboot package: mcpHill(data=counts,fact=group,boot=1000, mattype="Tukey"), where counts is the counts data matrix (number of sequence reads assigned to each taxon and group represents the selected sample grouping. More details can be found in: Pallmann et al, Assessing group differences in biodiversity by simultaneously testing a user-defined selection of diversity indices, 2012.
Regression: Multivariable linear regression
The Regression Page facilitates multivariable linear regression. Multivariable linear regression is a powerful techniques that can identify complex associations between community composition and multiple explanatory variables. Multiple co-variates can be included. Additional information can be found here.
Identify microbiome-environment associations by multiple regression
- Open the Regression Page. Make sure that the Demo Project has been loaded first, as described above.
- Set level to Genus
- Set Regress by to Taxa vs Envp and press Set Mode
- Set filter to 50 to restrict the analysis to the top 50 genera with the highest mean across samples
- Press Run Analysis
The displayed table shows associations between multiple explanatory variables(as defined in the meta information file) and taxa abundance. For each taxon a regression model is fit, including the taxon as dependent variable and all explanatory variables as independent variables:
Taxon = fa1 + fa2 + fa3 …,
where fa1, fa2, ... are explanatory variables. P-values are shown for each variable-taxa combination, indicating the significance of associations. Click on the header to sort the displayed table, e.g. by p-value.
The following result table is calculated for the demo project. The table was sorted by p-values obtained for age group (by clicking on Age.p in the header of the table). Ten genera are significantly associated with age (p<0.05), indicated in blue. In total, five bacterial groups on genus level are still significantly associated with age group after correction for multiple testing by FDR (column Age.p.fdr, red box). Significant associations for other explanatory variables can be viewed by re-sorting the table.
Explore associations between taxa and selected explanatory variable by multiple regression
- Open the Regression Page. Make sure that the Demo Project has been loaded first, as described above.
- Set Regress by to Age' to explore associations between age and abundance of bacterial genera
- Press Set Mode
- Press Run Analysis
- Click on "P" in the generated table to order data by p-value
For each genus the results table lists the correlation R between age and abundance of that genus. A p-value is given, indicating the statistical significance of the observed correlation. Additionally, the table presents the mean abundance across all samples and the number of positive samples, where the genus has been detected (abundance >0).
The top genera with the highest abs(R^{2}) (absolute of correlation coefficient) are selected and the association of these genera with age is examined by multivariable regression. This model incorporates age as dependent variable and the top genera as explanatory variables: age ~ genus 1 + genus 2 + ….
Tesults are presented as tables and figures. For each included genus the p-value computed by multivariable regression is reported, indicating if the genus is significantly associated with age. Additionally, a scatter plot is shown, plotting the abundance of the genus (x-axis) versus the age (y-axis). The plots show that Bifidobacteria decreases with age, whereas Oscillospira increases.
Identify associations between microbial diversity and multiple explanatory variables by multiple regression
- Open the Regression Page. Make sure that the Demo Project has been loaded first, as described above.
- Set level to "OTU"
- Set "Regress by" to "Diversity vs Envp" and press "Set Mode"
- Set index to Shannon
- Set filter to 50 to restrict the analysis to the top 50 OTUs with the highest mean across samples
- Press "Run Analysis"
The table and figures display associations between the community diversity (Shannon index) and multiple explanatory variables(as defined in the meta information file). A regression model is fit, including the diversity as dependent variable and all explanatory variables as independent variables:
diversity = fa1 + fa2 + fa3 …, where fa1, fa2, ... are explanatory variables.
P-values indicate statistical significance of associations.
For numeric variables, scatterplots visualize the correlation of the variablewith diversity. The p-value indicates the significance of the Pearson correlation between the community diversity and the factpr. In the second scatterplot, the diversity index is controlled for all remaining explanatory variables(the controlled correlation is called "partial correlation"). For example the association between diversity and gender is controlled for BMI, age and location (by linear regression).
For discrete variables, boxplots along with an ANOVA-test describe the significance of the variable.
The table below indicates that age and locations are significantly associated with Shannon diversity. The second scatterplot illustrates a positive correlation between age and Shannon index (corrected for the remaining explanatory variables gender, location). The correlation is significant (Pearson correlation, p=0.024).
Details
Analysis of longitudinal data (time series; repeated measures)
Calypso provides powerful methods for the analysis of data with repeated measures, i.e. studies were several samples were collected from the same subject or environment (e.g. longitudinal studies). Data with repeated measures is analysed using mixed effect regression models (also called multilevel models).
Tutorial: Identify differentially abundant taxa
- Open the Time Series Page. Make sure that the Demo Project has been loaded first, as described above.
- Set Type to Taxa and press Draw Chart
Differentially abundant taxa are identified by fitting linear mixed effect regression models of the form: taxa abundance = time point + individual, where time point is included as fixed effect and individual as random effect. The time point variable can be selected via the Time variable drop down menu. The individual is defined in the third column of the meta data file (pair variable). P-values (time.p) are adjusted by Bonferroni correction and FDR. Coefficients are shown in column time.c.
Tutorial box plot of taxa abundance
- Open the Time Series Page. Make sure that the Demo Project has been loaded first, as described above.
- Set Type to Taxa
- Select a taxa in the Taxa drop down menu and press Draw Chart
The box plot shows the abundance of the selected taxa at each time point. Samples collocated from the same individual are connected with a line.
Tutorial: Diversity analysis of repeated measures
- Open the Time Series Page. Make sure that the Demo Project has been loaded first, as described above.
- Set Type to Diversity and press Draw Chart
The box plot shows the microbial diversity at each time point. Samples from the same individual are connected by a line. A linear mixed effect regression model is fit of the form: diversity = time point + individual, where time point is included as fixed effect and individual as random effect. The time point variable can be selected via the Time variable drop down menu. The individual is defined in the third column of the meta data file (pair variable).
Network: Network analysis
Calypso enables network analysis for identifying co-occurring bacteria, mutual exclusive bacteria and clusters of co-occurring bacteria. Taxa and explanatory variables are represented as nodes, taxa abundance as node size, and edges represent positive and negative associations. Nodes can be colored by the phylum or family of the represented bacterial taxon. Alternatively, nodes (taxa) can be colored based on their association with selected environmental variables. Taxa abundances are associated with environmental variables using Pearson’s correlation. Nodes are then colored based on the strength of the association with each selected environmental variable. Networks are generated by first computing associations between taxa using the Pearson’s correlation index or Spearman’s rho. The resulting pairwise correlations are converted into dissimilarities and then used to ordinate nodes in a two dimensional plot by PCoA. In this way, correlating nodes are placed in close proximity and anti-correlating nodes are placed at distant locations in the network. Subsequently, nodes of positively or negatively correlating taxa are connected with yellow/green and blue edges, respectively. Networks are presented as static image or dynamic plots. Dynamic networks are provided under the Hierarchy tab. These have been realized using the Javascript D3.js library.
Nodes can be connected with edges based on three different criteria:
- Nodes are connected by an edge if their absolute Pearson’s correlation or Spearman’s rho is above a selected threshold. Alternatively, the significance of Pearson’s or Spearman’s correlation can be measured and correlations with p<0.05 are presented by an edge.
- LASSO regression. Iteratively, the abundance of each taxon is regressed on all remaining taxa using LASSO regression and the most relevant associations are identified. Relevance is assessed by 10-fold cross-validation. Significantly associated taxa are connected be edges. Edge width represent the coefficients learned by the final regression model.
- We have implemented a new ensemble method based on multiple similarity/dissimilarity measures. This method combines Bray-Curtis dissimilarities with Pearson’s correlation and Spearman’s rho. For each similarity/dissimilarity measure significance of correlation is computed. P-values for Bray-Curtis dissimilarities are computed by 1000-fold permutation. P-values obtained for the multiple similarity/dissimilarity measures are then combined using the Simes method and corrected for multiple testing by False Discovery Rate (FDR).
- Significant associations (FDR < 0.05) are presented as edge.
Network generates a correlation network of taxa only. "Network+" also includes
numeric explanatory variables. Select "Layout By Correlation" to layout nodes based on a PCoA.
Tutorial 1: Network with nodes colored based on assoication to environmental variables
- Open the Network Page. Make sure that the Demo Project has been loaded first, as described above.
- Set Type to Network
- Set Level to "OTU"
- Press Draw Chart
- Un-select "Show node labels"
- Set red to "Location.USA", blue to "Location.Malawi" and yellow to "Location.Venezuela"
- Press "Draw Chart"
The following network is obtained for the demo project. Edges indicate positive correlations. The figure indicates clustering of taxa according to geographic location.
Tutorial 2: Network with black background and positive and negative edges
- Open the Network Page. Make sure that the Demo Project has been loaded first, as described above.
- Set Type to Network
- Set Level to "OTU"
- Press Draw Chart
- Un-select "Show node labels"
- Un-select "Show only positive correlations as edges"
- Open advanced options and set "Background color" to black
- Set red to "Location.USA", blue to "Location.Malawi" and yellow to "Location.Venezuela"
- Press "Draw Chart"
The following network is obtained for the demo project. Yellow edges indicate positive correlations and blue edges negative correlations. The figure indicates clustering of taxa according to geographic location.
Tutorial 3: Network with nodes colored by Phylum
- Open the Network Page. Make sure that the Demo Project has been loaded first, as described above.
- Set Type to Network
- Set Level to OTU
- Press Draw Chart
- Un-select Show node labels
- Set Color nodes based on to Phylum
- Press Draw Chart
- To show both negative and positive correlations, un-select Show only positive correlations as edges and press Draw Chart
Open network in Cytoskape
Generated networks can be saved in text format and opened in network visualisation software, such as Cytoskape.
Tutorial
- Generate network in Calypso as described in the above tutorial
- Save network in text format (GML) by right clicking on "Download network in GML format (can be opened in Cytoskape)" (right mouse click "save as …")
- Make sure the saved file has suffix ".gml'. Depending on your browser, you may need to change suffix ".gml.txt" to ".gml" in File Explorer (Windows), Finder (Mac) or shell (Linux).
Open network in Cytoskape:
- Download and install Cytoskape
- Start Cytoskape
- Press "Import Network from File" and load the saved network in GML format
- In the "Layout" menu, press "Apply Preferred Layout"
The following network was generated for the demo project in Cytoskape:
mixMC: mixOmics microbial community studies
mixMC is a multivariate framework which takes into account the sparsity and compositionality of microbiome data. mixMC aims to identify specific associations between microbial communities and explanatory variables, such as habitat. It builds on the hypothesis that multivariate methods can help identify microbial communities that modulate and influence biological systems as a whole [8] [9]. The framework is based on the mixOmics R package.
To use mixMC in Calypso, please upload either raw (non-normalized) or rarefied counts data. Please do not upload TSS or CSS normalized counts data. However, data can be normalization in Calypso, if desired. Regardless of the selected normalization method in Calypso, the mixMC module will use the raw uploaded data after filtering.
The mixMC framework supports the analysis of repeated measurement data using multivariate projection-based approaches such as Principal Components Analysis (PCA) (data mining and exploration) and sparse Partial Least Squares Discriminant Analysis (sPLS-DA) (for identification of indicator species characterising each biological condition).
Reference
Le Cao, K.A., Costello, M.E., Lakis, V.A., Bartolo, F., Chua, X.Y., Brazeilles, R. and Rondeau, P., 2016. mixMC: a multivariate statistical framework to gain insight into Microbial Communities. bioRxiv, p.044206.
In Calypso, the mixMC module will first normalize taxonomic counts by TSS.
PCA
The mixMC PCA is run in R using the following parameters:
pca(X = counts.TSS, ncomp = 10, logratio = 'ILR').
Or for experiments with repeated measurements, i.e. experiments where multiple samples were collected from each individual, e.g. at different body parts:
pca(X = counts.TSS, ncomp = 10, logratio = 'ILR', multilevel = individual)
Where counts.TSS are TSS normalised taxonomic (or OTU) counts and individual are the individuals as defined in the third column of the metadata file.
sPLS-DA
When run in Calypso, the parameters of the sPLS-DA are estimated using the following R command:
tune = tune.splsda(counts.TSS, Y, ncomp = nlevels(Y) -1 , folds=10,nrepeat = 30) choice.keepX = tune$choice.keepX choice.ncomp = length(choice.keepX)
The sPLS-DA is then run in Calypso using one of the following R commands:
1) res.spls<-splsda(X = counts.TSS, Y = Y, logratio='CLR', ncomp = choice.ncomp, keepX = choice.keepX)
2) Or for repeated measurement designs: res.spls <- splsda(X = counts.TSS, Y = Y, multilevel = individual, logratio='CLR', ncomp = choice.ncomp, keepX = choice.keepX)
Where counts.TSS are TSS normalised counts, Y is the selected biological condition and individual are the individuals as defined in the third column of the metadata file.
The performance of the model is evaluated using the following command:
perf <- perf(res.spls, validation = 'Mfold', folds = 5, nrepeat = 10) plot(perf)
If repeated measurement data is analysed, individuals with only one single sample are excluded from the the analysis.
Results of the sPLS-DA are presented in three figures and a table. The sPLS-DA selects the most discriminative taxa/OTUs that best characterize the selected biological conditions. The contribution plot and table show the taxa/OTUs associated with each biological condition. Additionally the performance of the sPLS model is shown. Networks represent the relevant associations between microbial OTUs and the selected biological conditions. More information for the network plot can be found here.
mixMC results for the example data set
Biomarker discovery
Calypso provides simple yet powerful methods for biomarker discovery. Predictive biomarkers associated with two sample groups (e.g. cases and controls; responders and non-responsers) are identified by t-test, Wilcoxon rank test, nested anova, logistic regression or the random forest classifier. The discriminatory power of biomarker candidates is described by the area under the ROC curve (AUC), odds ratio, delta (difference in means in units of standard deviation) or fold change.
Biomarker discovery is only possible for variables with exactly two different values (e.g. cases and control).
Tutorial
- Open the Biomarker Page. Make sure that the Demo Project has been loaded first, as described above.
- Set level to Genus
- Set Group by to Primary Group
- Set filter to 30 to only include the top 30 taxa with highest mean value across samples
- Press Draw Chart
For each taxa a p-value, adjusted p-value (FDR, Benjamini-Hochberg and Bonferroni), Area Under the ROC Curve (AUC with 95% upper and lower
confidence intervals), odds ratio (with 95% upper and lower confidence intervals), delta
(difference in mean divided by standard deviation), and fold change are calculated. Odds ratios
are visualized as forest plots. Together these values indicate if taxa are potential
predictive biomarkers for classifying samples into the two classes.
If test is set to LogisticRegression, p-value are calculated by logistic regression corrected for all explanatory variables defiend in the meta information file: group ~ taxa + v1 + v2 + v3 + ...., where v1, v2, ...) are explanatory variables. Also AUC and odds ratio will be adjusted for all explanatory variables. In detail: odds ratios are calculated by logistic regression including all explanatory variables defined in the meta information file.
The following plot shows the forest plot obtained for the demo project.
Hierarchy: Use taxonomy reference for visualizations
This Calypso feature is only enabled when a taxonomy file has been uploaded.
Generate taxonomic trees
Tutorial
- Open the Hierarchy Page. Make sure that the Demo Project has been loaded first, as described above.
- Set type of visualization to Dendogram
- Press "Select Mode"
- Set "Group by" to "Primary group"
- Press Draw Chart
- Click on node to expand or hide subtrees
- Press "DownloadPNG" below the figure to save the image
Krona plots
Tutorial
- Open the Hierarchy Page. Make sure that the Demo Project has been loaded first, as described above.
- Select Krona as visualization type
- Press Select Mode
- Set Group by to Primary group
- Press Draw Chart
- Click on Click here to open Krona in a new window.
- A new browser window will open displaying Krona pie charts of community composition
- Select Metropolitan in the selection list (left upper corner)
- Decrease the max depth by clicking on "-" to view the taxonomic level family
- Increase the font size by clicking on "+"
- Enable the Collapse checkbox to combine taxonomic levels, i.e. taxa with only one child will be merged
- Double-click Firmicutes in the pie chart to set this phylum as the root
- Click on all in the centre of the Krona plot to view the complete taxonomic hierarchy
- Click Snapshot to export the Krona plot
- Save the Krona Chart by right-clicking Save Page As...
More information about Krona visualizations can be found on the Krona's website
Example
Details Hierarchy Plots
FS: Feature Selection
The optimal subset of taxa predictive of an outcome of interest can be determined using standard feature selection methods. Feature selection methods identify relevant taxa associated with an explanatory variable or confounding factor. These methods are based on the premise that many taxa are either redundant (highly correlated) or irrelevant, and can thus be removed without much loss of information. Calypso implements the standard, widely used feature selection methods: step-wise linear regression, LASSO regularized regression, and random forest. LASSO performs both feature selection and regularization to prevent overfitting. Random forest identifies the subset of most relevant features by constructing a collection of decision trees. Variance is controlled by constructing trees incorporating only a random subset of the features, which in turn avoids overfitting.
Available methods:
- Step-wise regression: step-wise regression implemented using the step() function (R stats package)
- LASSO regularised regression: LASSO performs both feature selection and regularisation to prevent overfitting. The method is implemented via the cv.glmnet() function from the R glmnet package.
- Random Forest: Feature selection by random forest, implemented via the randomForest() function from the R randomForest package
Tutorial Feature Selection by Step-Wise Regression
- Open the Feature Selection Page. Make sure that the Demo Project has been loaded first, as described above.
- Select Step-wise Regression
- Press Select Mode
- Press Draw Chart
A linear regression model is selected in an iterative approach (forward stepwise regression). The shown figure provides details of the final, selected regression model. Shown are the number of taxa that were included in the analysis (included features), the number of selected taxa (selected features), and the AIC and Area Under the Curve (AUC) of the final (selected) model.
The AIC (Akaike information criterion) is a measure of the relative quality of the regression model and is used to compare and select models. Lower AIC values indicate relatively better models. AIC estimates the information lost when a given model is used to represent the data. It deals with the trade-off between the goodness of fit of the model and the complexity of the model [10].
The selected taxa ordered by their importance are shown in the centre of the figure. Bars depict the importance of each taxa, i.e. the AIC of the model if that taxa was dropped from the model. A large relative increase in AIC indicates that the taxa is important for the model. The AIC of the model including all selected taxa is shown at the top bar and as dashed red line.
The box plot in the lower left of the figure examines the quality of the selected model, i.e. how well the model represents the original data. For the outcome of interest, the plot shows the original value versus the value that is modelled by the model.
Tutorial Feature Selection by Lasso Regularized Regression
- Open the Feature Selection Page. Make sure that the Demo Project has been loaded first, as described above.
- Select Lasso Regression
- Press Select Mode
- Press Draw Chart
An optimal linear regression model is selected by LASSO regularised regression. The shown figure provides details of the selected regression model. Shown are the number of taxa that were included in the analysis (included features), the number of selected taxa (selected features), and the AIC and Area Under the Curve (AUC) of the final (selected) model.
The selected taxa ordered by their importance are shown in the centre of the figure. Bars depict the importance of each taxa (absolute t-statistic).
The box plot in the lower left of the figure examines the quality of the selected model, i.e. how well the model represents the original data. For the outcome of interest, the plot shows the original value versus the value that is modelled by the model.
Tutorial Feature Selection by Random Forest
- Open the Feature Selection Page. Make sure that the Demo Project has been loaded first, as described above.
- Select Random Forest' method and press Select Mode
- Press Draw Chart
The taxa selected by random forest are shown as bar plot, bars represent the importance of each taxa (estimated by permutation, as returned by the importance() function from the R randomForest package).
Paired: Paired analysis
The Pairwise Page facilitates paired statistical analysis. Paired analysis makes use of paired study designs, where several samples where taken from the same individual (e.g. before and after treatment) or from MZ twin pairs in a case/control study. Comparisons are done by paired t-test or paired Wilcoxon rank test.
Details
Norm: Examine effects of data normalisation/transformation
The Norm Page allows to examine the effects of different normalisation and transformation methods on the data distribution. Raw data (as uploaded to Calypso) and transformed data are shown. This does not have any effect on the data used by other Calypso pages.
FA: Factor Analysis
The FA Page facilitates factor analysis, a data reduction and structure detection method. Factor analysis is a statistical method used to reduce the number of variables and to identify structures in the relationships of variables. The aim is to reduce a large and complex dataset to a low number of factors without loosing relevant information.
Factor analysis describes the variability among observed, correlated variables in terms of a lower number of unobserved variables called factors. Assume a dataset with hundreds or thousands of variables and only a few samples. Most likely many variables are highly correlating and contain redundant information. The basic idea of factor analysis is to combine multiple correlating variables into one representative factor. The original data is then described by the lower number of factors, which approximated the original variables.
Factor analysis is implemented in R using the nmf() function from the NMF package. The default algorithms is Brunet. The rank (number of factors) is set by the user.
General options
The following table described general Calypso options that are present on most analysis pages.
Field | Description |
---|---|
Figure Format | The file format of generated figures (PNG, PDF or SVG). |
Level | The taxonomic level (if provided: superkingdom, phylum, class, order, genus, species) or OTU |
Order | The order of samples in the generated figures. Samples can be ordered by their primary group, secondary group, pair, or label. |
Filter | Selects how many of the top most aboundant taxa (highest mean across samples) are included. Set to 0 to include all taxa. |
Color | Color palette for plots |
Secondary Group | Used to filter samples by their secondary group. Select "All" to include all samples. |
Distance | Distance metric for computing pair-wise distances of samples. |
Resolution | Resolution of the generated plot in dpi. Allowed range: 20-1000 |
Width | Width of generated plot in mm. |
Height | Height of generated plot in mm. |
Implementation
The Calypso web-frontend is implemented in Java using the JavaServer Faces architecture. Interactive views are facilitated by the Javascript library D3.js. The backend is implemented in Perl and the R statistical programming language. No installation, configuration, registration or login is required. Data is kept privately and cannot be viewed by other users. Uploaded data and calculated results are deleted after the users session terminates.
FAQ
How to cite Calypso
Zakrzewski M, Proietti C, Ellis J, Hasan S, Brion MJ, Berger B, Krause L (2016) Calypso: A User-Friendly Web-Server for Mining and Visualizing Microbiome-Environment Interactions. Bioinformatics. Accepted on 15/11/2016.
If you find this software useful, please help us by sharing Calypso on social media (Facebook, Twitter, Google+, etc).
Error message: An internal error occurred, likely you either didn't upload a data and meta data file or your session has expired
Before using Calypso, you need to upload a data and meta data file. Select Home in the top menu and upload a data and meta data file or press "Start Demo Project". Make sure that you have activated java script in your browser settings and allow cookies. If you receive this error message after pressing "Start Demo Project" your browser does not allow cookies.
You also receive this error message if your session has expired. Your session expires if you haven't used Calypso for more than 60 minutes.
Figure labels are only partially displayed
Increase the figure width and height.
CCA changes if "Color by" is changed, but CCA+ does not
CCA provides a p-value if variance observed in the community composition can be explained by the sample groups. The sample grouping can be selected by "Color by". CCA+ describes if variance in the community composition can be attributed to environmental variables. CCA+ does not include the sample groups for the statistical analysis. However, samples are colored by the grouping selected under "Color by".
Figure labels overlap
Increase the figure width and height.
Figure legend overlaps with chart
Increase the figure width and height.
Figures are completely white
Solution: Reduce resolution of figure or increase figure width and height.
Error message: Internal ERROR: null dataMatrix
Likely you forgot to upload a data and/or meta data file. Select Home in the top menu and upload a data and meta data file.
Error message: j_id_id15:resolution: Validation Error: Value is not of the correct type
You have entered an incorrect value in one of the text fields. Valid value ranges are: Resolution: 20-1000 Width: 20-10000 Height: 20-10000 Min proportion: 0-100, real numbers are supported, e.g. 0.2
The figure in Network+ does not contain my environmental variables
Please check if your annotation file includes the environmental variables. Calypso sets the columns 7 and the following columns of the meta data files as the environmental variables.
Warning: no environmental variables provided
Please check if your meta data file includes the environmental variables. Calypso sets the columns 7 and the following columns of the meta data files as the environmental variables. Without the environmental variables, methods such as CCA+, RDA+ cannot be executed.