Calypso File Upload
- 1 Input files
- 2 Example input files
- 3 Meta data file
- 4 Data file
- 5 Calypso format
- 5.1 Calypso V3 format
- 5.2 Calypso OTU table format
- 5.3 Details Calypso V3 format and Calypso OTU table format
- 5.4 Example data matrix in Calypso v3 format
- 5.5 Example data matrix in Calypso OTU table with tax information format (comma separated)
- 5.6 Mothur Summary.tax files
- 6 Distance matrix
- 7 Taxonomy file
- 8 Data upload
- 9 Data normalisation and transformation
- 10 Filter rare taxa
- 11 Filter Taxa
As input Calypso requires a data matrix providing taxonomic information (counts file) and an annotation file providing metadata for each sample. Input files can be in tab separated or comma separated format and can be created in spreadsheet programs such as Excel. The data matrix can also be in the common biom format or in the Mothur Summary.tax format. The data upload page provides several normalization and data transformation methods, including centered log ratio, log or square root transformation.
Additionally, an optional taxonomy and a matrix of pair-wise community distances can be uploaded. For example, this facilitates data analysis using the UniFrac metric.
Example input files
Example input files can be downloaded from here:
Meta data file
The metadata file consists of a simple text file in comma- or tab-separated format which provides meta-information for each sample. Meta-information includes: (i) individual identifiers, which are used for paired analysis (e.g. paired t-test), if several samples were collected from the same individual (or environment) during a longitudinal study; (ii) sample groups (e.g. case/control or geography); (iii) an Include column specifying if samples should be included or excluded from data analysis. This allows the exclusion of outliers or problematic samples without modifying any data files; and iv) multiple optional explanatory (or environmental) variables, which are used in multivariate data analysis. These can represent variables manipulated by the experimenter (e.g. case/control), potential confounding factors (e.g. age, BMI) or other factors that are potentially associated with community composition. Both numeric and categorical variables are supported.
The meta data file can be created in Excel. Rows represent samples, columns meta information. All samples present in the data file must be defined in the meta data file.
Calypso provides four different formats, the simple, basic, V6 and V3 format.
The first column must list the ids of all samples. Additionally, one or more columns of metadata must be provided (study groups, biological condition, treatment etc).
The first column must list the ids of all samples, the second column provides sample labels and the third column takes values 1 and 0 to specify which samples are included (1) and which samples are excluded (0) from all analysis. Additionally, one or more columns of metadata must be provided (study groups, biological condition, treatment etc).
V6 and V3 formats
The V6 format is a simplified version of the V3 format and will be the default format in the future. Please make sure to select the correct format when uploading the metadata file.
More details can be found here:
"NA" is a reserved value and depicts missing values. Missing values must be given as NA.
Meta data file format
- The first row of the meta data file is a header row. The names of each column can be defined by the user, but the order of the columns must follow the above specification.
- Fields must not be quoted
- After creating and saving the file in in a spreadsheet programs such as Excel, open the file in Notepad or WordPad to control the format (e.g. ensure that values are not quoted and that the file is in either comma, tab or semicolon separated format).
The data file provides the number of 16S or metagenomic sequences assigned to each taxa or operational taxonomic unit (OTU). Various file formats are supported, including the common biom-format, which allows direct upload of pre-processed files generated by other analysis pipelines, such as QIIME, mothur (Schloss, et al., 2009), MG-RAST (Meyer, et al., 2008) or MetaPhlAn (Segata, et al., 2012). A data matrix in biom format can directly be uploaded into Calypso. Biom and QIIME mapping files can also be converted using our Converter.
Calypso also supports upload of Mothur OTU tables. More details can be found below.
For users preprocessing their data with QIIME we recommended upload of the QIIME generated biom files and the QIIME generated distance files. For QIIME 2 users we recommend uploading the QIIME 2 generated biom file and additionally the QIIME 2 generated .tsv file. Alterantively, QIIME 2 users can format their data in the Calypso OTU table with QIIM2 .tsv file format and additionally upload the QIIME 2 generated .tsv file.
Data files can also be formated as simple n x m matrix in text format (Calypso v3 format or Calypso OTU table format). Text files can either be comma, tab or semicolon separated. Rows represent taxa, columns represent samples, values represent the number of sequences assigned to each taxon. The file can be created in spreadsheet programs such as Excel.
Calypso V3 format
Simple text file providing read counts for observed taxa. Taxonomic assignments obtained for multiple taxonomic ranks (e.g. phylum, family, genus, OTU) can be combined in a single counts file.
Calypso OTU table format
Calypso OTU table with tax information is a simplied version of the Calypso v3 format and only provides taxonomic counts on OTU level. Taxonomic counts for all other levels (species, genus, family, order, class and phylum) are automatically generated by Calypso.
Details Calypso V3 format and Calypso OTU table format
The first row is a header line, providing the ids of all included samples. These ids must match the sample ids defined in the meta data file. The header line has the following format: The first column defines the taxonomic rank (e.g. OTU, species, genus) and the second column must be named "Header" (Calypso identifies header lines using this "Header" tag). The following columns represent sample ids. Example header line:
Genus, Header, SA1, SA2, SA3, SA4
The following rows provides the number of sequences assigned to each taxon. The first column defines the taxonomic rank (e.g. OTU, species or genus), the second column defines the taxon and the following columns represent the number of sequences of each sample assigned to that taxon. Example row:
Genus, Bifidobacterium, 349, 467, 1092, 12
In the Calypso V3 format, multiple taxonomic ranks can be combined in a single data matrix file. For each rank a separate header line has to be supplied. Example:
Family, Header, SA1, SA2, SA3, SA4 Family, Lachnospiraceae, 3849, 2374, 1287, 384 Family, Prevotellaceae, 848, 383, 4582, 234, 32 Family, Bifidobacteriaceae, 4743, 2748, 3822, 384 Genus, Header, SA1, SA2, SA3, SA4 Genus, Enterococcus, 123, 3843, 2485, 234 Genus, Streptococcus, 747, 422, 857, 58 Genus, Clostridium, 458, 12, 284, 864, 78 Genus, Eubacterium, 84, 12, 45, 23, 192 Genus, Unclassified, 348, 384, 1343, 485
Unclassified is a special tag for the number of reads that could not be assigned to any of the listed classes (taxa, OTUs, etc). Unclassified will be excluded during some type of analysis, e.g. rarefaction analysis of community diversity.
After creating and saving the file in in a spreadsheet programs such as Excel, open the file in Notepad or WordPad to control the format (e.g. ensure that values are not quoted and that the file is in either comma, tab or semicolon separated format).
The Calypso OTU table with tax information format is a simplied version of the Calypso v3 format and only provides taxonomic counts for the OTU level. Taxonomic counts for all other levels (species, genus, family, order, class and phylum) are automatically generated by Calypso.
Example data matrix in Calypso v3 format
Example data matrix in Calypso OTU table with tax information format (comma separated)
OTU, Header, h278B.2, USygt45.T2, USygt45.T1, Amz7adltF, AmzC13babyF, Amz21chld, USygt36.T1, USygt36.T2, h165S
OTU, k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Peptococcaceae;g__;s__ 405, 520804, 20584, 114912, 612, 163, 1588, 34316, 96692, 254
OTU, k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhodospirillales;f__Rhodospirillaceae;g__uncultured;s__ 573, 61841, 134015, 4666, 8566, 649, 22989, 118644, 22878, 52945
OTU, k__Bacteria;p__Proteobacteria;c__Deltaproteobacteria;o__Desulfurellales;f__Desulfurellaceae;g__H16;s__uncultured bacterium 705, 68, 121828, 23050, 129, 91, 9, 281687, 184293, 1179
OTU, k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__;g__;s__ 768, 5079, 45258, 496, 26161, 4430, 58961, 16404, 98261, 120895
OTU, k__Bacteria;p__Saccharibacteria;c__uncultured bacterium;o__;f__;g__;s__ 375, 100774, 91010, 42211, 2532, 124, 17693, 45053, 93943, 57494
OTU, k__Bacteria;p__Bacteroidetes;c__Sphingobacteriia;o__Sphingobacteriales;f__Sphingobacteriaceae;g__Mucilaginibacter;s__ 478, 181189, 61491, 1, 16421, 346, 70059, 16, 6, 70121
OTU, k__Bacteria;p__Microgenomates;c__Candidatus Daviesbacteria;o__uncultured bacterium;f__;g__;s__ 413, 9227, 160161, 272497, 6877, 581, 15663, 13936, 70144, 51665
Mothur Summary.tax files
Calypso supports Mothur output files. More details can be found here.
Example detailed Summary.tax file:
taxlevel rankID taxon daughterlevels total F11Fcsw F12Fcsw F13Fcsw F14Fcsw F21Fcsw ... 0 0 Root 1 27517 699 666 925 1301 707 ... 1 0.1 Bacteria 10 27517 699 666 925 1301 707 ... 2 0.1.1 "Acidobacteria" 1 270 61 52 46 75 2 ... 3 0.1.1.1 Holophagae 1 270 61 52 46 75 2 ... 4 0.1.1.1.1 Holophagales 1 270 61 52 46 75 2 ... 5 0.1.1.1.1.1 Holophagaceae 1 270 61 52 46 75 2 ... 6 0.1.1.1.1.1.1 Holophaga 0 270 61 52 46 75 2 ... 2 0.1.2 "Actinobacteria" 1 258 44 31 55 79 1 ... ...
Example simple Summary.tax file:
taxon total A B C "k__Bacteria";"p__Actinobacteria";"c__Actinobacteria";"o__Bifidobacteriales";"f__Bifidobacteriaceae";"g__Bifidobacterium";"s__"; 1 0 1 0 "k__Bacteria";"p__Actinobacteria";"c__Actinobacteria";"o__Bifidobacteriales";"f__Bifidobacteriaceae";"g__Bifidobacterium";"s__adolescentis"; 1 0 1 0 "k__Bacteria";"p__Actinobacteria";"c__Actinobacteria";"o__Bifidobacteriales";"f__Bifidobacteriaceae";"g__Bifidobacterium";"s__longum"; 1 0 1 0 ...
Optionally, a distance matrix can be imported into Calypso, which can be subsequently utilized in distance-based methods, such as PCoA.
The matrix contains pairwise sample distances in comma, tab or semicolon separated format. Both row names and column names must be defined. A distance matrix can for example be calculated using UniFrac.
Example distance matrix
Calypso allows to select a reference taxonomy, which enables interactive visualizations of the community composition. The user can either select taxonomies already incorporate into the Calypso software or upload a custom taxonomy. Calypso supports custom taxonomies in file formats that are used by RDP or Greengenes. Taxa present in the data matrix file but absent in the taxonomy file will be ignored during the import and taxonomy based visualisations.
RDP uses Bergey's taxonomy. The custom file is in a XML format specified in the example below.
<file>RDPExampleFile</file> <TreeNode name="Root" taxid="0" rank="rootrank" parentTaxid="-1" leaveCount="1" genusIndex="-1"></TreeNode> <TreeNode name="Bacteria" taxid="1" rank="domain" parentTaxid="0" leaveCount="1" genusIndex="-1"></TreeNode> <TreeNode name=""Firmicutes"" taxid="61" rank="phylum" parentTaxid="1" leaveCount="1" genusIndex="-1"></TreeNode> <TreeNode name=""Lactobacillales"" taxid="160" rank="order" parentTaxid="62" leaveCount="1" genusIndex="-1"></TreeNode> <TreeNode name="Streptococcaceae" taxid="203" rank="family" parentTaxid="160" leaveCount="2" genusIndex="-1"></TreeNode> <TreeNode name="Streptococcus" taxid="206" rank="genus" parentTaxid="203" leaveCount="1" genusIndex="2"></TreeNode> <TreeNode name="Lactococcus" taxid="204" rank="genus" parentTaxid="203" leaveCount="1" genusIndex="273"></TreeNode>
Calypso requires the attributes:
- name: name of the taxon, which should be equivalent to the taxa name defined in the data matrix file,
- taxid: a taxonomy ID for the taxon,
- rank: either rootrank, domain, phylum, class, order, family or genus,
- parentTaxid: taxonomy ID of the parent of the taxon in the taxonomic hierarchy.
The attribute values must always be enclosed in double quotation marks.
The taxonomy file in Greengenes format is a text file that lists the taxonomy id and the lineage of the taxon in each line as exemplified below.
1043951 k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Propionibacteriaceae; g__Propionibacterium; s__acnes 4449524 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Prevotellaceae; g__Prevotella; s__ 4464084 k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Paenibacillaceae; g__Paenibacillus; s__ 748636 k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Microbacteriaceae; g__Mycetocola; s__ 553810 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__aeruginosa 93366 k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Bacillaceae; g__; s__
Each line must start with a number (tax id) and contain superkingdom (k__), phylum (p__), class (c__), order (o__), family (f__), genus (g__) and species (s__), which all must separated by a semicolon followed by a space "; ". If no classification on a rank is available, the rank must still be listed with an empty taxon name (e.g. "g__;")
Example taxonomy file
Data is uploaded via the Data upload page. Various normalization and data transformation methods are provided, including centered log ratio, log, asinh, square root transformation and quantile normalization.
Quantile normalization has been implemented using the normalise.quantile() function from the R preprocessCore library. The function is run with default parameters.
Data normalisation and transformation
Calypso provides various transformation methods to account for the generally non-normal distribution of microbial community composition data. To render the data suitable for analysis by standard statistical procedures, community profiles can be transformed by log, asinh and square root transformation. Sequencing based community profiling yields a special data type called “compositional data”, which is characterized by specific intrinsic properties that can deteriorate statistical analysis. The elements of a compositional data vector are non-negative and sum to a constant. To remove the non-independence of relative bacterial abundance, Calypso facilitates data transformation by centered log ratio, which is one of the most widely used transformations for compositional data.
Additionally, counts data can be normalised by total-sum normalization (TSS), the most commonly used approach. TSS divides read counts by the total number of reads in each sample and thereby converts read counts to appropriately scaled ratios.
Scale scales the data. For each feature (row of data matrix) the values are scaled to mean=0 and variance=1 using the R scale() function.
Filter rare taxa
Rare taxa, which likely represent sequencing errors, are excluded during data import. The threshold can be set using the "Filter rare taxa" parameter. The default value is 0.5%, meaning that only taxa are important, to which >0.5% of sequences are assigned in at least one sample.
Cyanobacteria and Chloroplasts can be excluded from the counts file using the "Filter Taxa" drop-down menu.