# Calypsos Stats Details

## Contents

## StatsPage

Available tests: Parametric tests (select Anova, "Nested Anova" or anova+) and none parametric tests (select RankTest).

P-values are adjusted for multiple testing by Bonferroni (P.adj) and false discovery rate (FDR q-value, Benjamine-Hochberg procedure). Click on the header to order the data table.

The p-value distribution is also plotted as histogram and compared to the expected distribution by a qq-plot. A uniform p-value distribution (red "Expected" line) indicates that low p-values are observed by chance only. A p-value distribution with more low p-values than expected by chance indicates that at least some of the observed differences are relevant and not observed by chance only (as in the figure shown below).

### RankTest

If RankTest is selected, a Wilcoxon test is performed if 2 groups are compared, otherwise a Kruskal-Wallis test.

### Anova

Data values across sample groups are compared by Analysis of variance (ANOVA). ANOVA is used to analyze the differences among group means.

### NestedAnova

Select "Nested Anova" for nested designs " Handbook of Biological Statistics. In this case, the primary group (3rd column of meta data file) is used as group (e.g. treatment), the secondary group (5th column of meta data file) as subgroup (nested variable, e.g. animal cage). The following test is run in R: aov(abundance(taxa) ~ primaryGroup/secondaryGroup).

### anova+

Multivariate anova to identify associations between features and multiple factors. Numeric factors are categorized automatically (e.g. if a variable in the range 0 - 200 it is divided into the categories 0-50,50-100,100-150,150-200). An anova is run separately for each feature. The feature is included as dependent variable, each factor as explanatory variable. The following test is run in R: aov(feature ~ factor1 + factor2 + ...). Anova is run without interaction terms.

### Negative binominal distribution (DESeq2)

Statistical testing based on the negative binominal distribution is implemented using the DESeq2 software. DESeq2 has been developed for RNA-seq data, but can also be used for community composition data. To use DESeq2, raw taxonomic counts have to be uploaded.

The method is implemented in R, using the DESeq2 bioconductor package:

des<-DESeq(dds, test="Wald", fitType="parametric") p<-results(des,cooksCutoff=F)$pvalue

, where dds is a DESeq object created from the counts data via the DESeqDataSetFromMatrix function.

### Multivariate Random Forest

Complex associations between multiple factors and measurements can be identified by the multivariate Random Forest. Score indicate strengths of the associations. The score is the mean decrease in accuracy of the decision tree if the respective feature is removed. The following function is called in R: randomForest(group~factor1+factor2+factor3+...,importance=T,proximity=F,ntree=10000,mtry=20)