Calypso File Upload

From wiki
Revision as of 04:42, 26 March 2018 by Calypso (Talk | contribs) (Example input files)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Input files

As input Calypso requires a data matrix providing taxonomic information (counts file) and an annotation file providing metadata for each sample. Input files can be in tab separated or comma separated format and can be created in spreadsheet programs such as Excel. The data matrix can also be in the common biom format or in the Mothur Summary.tax format. The data upload page provides several normalization and data transformation methods, including centered log ratio, log or square root transformation.

Additionally, an optional taxonomy and a matrix of pair-wise community distances can be uploaded. For example, this facilitates data analysis using the UniFrac metric.

Example input files

Example input files can be downloaded from here:

Example Data Matrix in Calypso V3 format,

Example Data Matrix in Calypso OTU Table format with tax information,

Example Metadata file in v3 format,

Example Metadata file in v6 format,

Example Distance Matrix

Meta data file

The metadata file consists of a simple text file in comma- or tab-separated format which provides meta-information for each sample. Meta-information includes: (i) individual identifiers, which are used for paired analysis (e.g. paired t-test), if several samples were collected from the same individual (or environment) during a longitudinal study; (ii) sample groups (e.g. case/control or geography); (iii) an Include column specifying if samples should be included or excluded from data analysis. This allows the exclusion of outliers or problematic samples without modifying any data files; and iv) multiple optional explanatory (or environmental) variables, which are used in multivariate data analysis. These can represent variables manipulated by the experimenter (e.g. case/control), potential confounding factors (e.g. age, BMI) or other factors that are potentially associated with community composition. Both numeric and categorical variables are supported.


The meta data file can be created in Excel. Rows represent samples, columns meta information. All samples present in the data file must be defined in the meta data file.

Calypso provides four different formats, the simple, basic, V6 and V3 format.

Simple format

The first column must list the ids of all samples. Additionally, one or more columns of metadata must be provided (study groups, biological condition, treatment etc).

Basic format

The first column must list the ids of all samples, the second column provides sample labels and the third column takes values 1 and 0 to specify which samples are included (1) and which samples are excluded (0) from all analysis. Additionally, one or more columns of metadata must be provided (study groups, biological condition, treatment etc).

V6 and V3 formats

The V6 format is a simplified version of the V3 format and will be the default format in the future. Please make sure to select the correct format when uploading the metadata file.

More details can be found here:

Metadata V3 format

Metadata V6 format



Missing values

"NA" is a reserved value and depicts missing values. Missing values must be given as NA.

Meta data file format

  • The first row of the meta data file is a header row. The names of each column can be defined by the user, but the order of the columns must follow the above specification.
  • Fields must not be quoted
  • After creating and saving the file in in a spreadsheet programs such as Excel, open the file in Notepad or WordPad to control the format (e.g. ensure that values are not quoted and that the file is in either comma, tab or semicolon separated format).

Data file

The data file provides the number of 16S or metagenomic sequences assigned to each taxa or operational taxonomic unit (OTU). Various file formats are supported, including the common biom-format, which allows direct upload of pre-processed files generated by other analysis pipelines, such as QIIME, mothur (Schloss, et al., 2009), MG-RAST (Meyer, et al., 2008) or MetaPhlAn (Segata, et al., 2012). A data matrix in biom format can directly be uploaded into Calypso. Biom and QIIME mapping files can also be converted using our Converter.

Calypso also supports upload of Mothur OTU tables. More details can be found below.

QIIME Users

For users preprocessing their data with QIIME we recommended upload of the QIIME generated biom files and the QIIME generated distance files. For QIIME 2 users we recommend uploading the QIIME 2 generated biom file and additionally the QIIME 2 generated .tsv file. Alterantively, QIIME 2 users can format their data in the Calypso OTU table with QIIM2 .tsv file format and additionally upload the QIIME 2 generated .tsv file.

Calypso format

Data files can also be formated as simple n x m matrix in text format (Calypso v3 format or Calypso OTU table format). Text files can either be comma, tab or semicolon separated. Rows represent taxa, columns represent samples, values represent the number of sequences assigned to each taxon. The file can be created in spreadsheet programs such as Excel.

Calypso V3 format

Simple text file providing read counts for observed taxa. Taxonomic assignments obtained for multiple taxonomic ranks (e.g. phylum, family, genus, OTU) can be combined in a single counts file.

Calypso OTU table format

Calypso OTU table with tax information is a simplied version of the Calypso v3 format and only provides taxonomic counts on OTU level. Taxonomic counts for all other levels (species, genus, family, order, class and phylum) are automatically generated by Calypso.

Details Calypso V3 format and Calypso OTU table format

The first row is a header line, providing the ids of all included samples. These ids must match the sample ids defined in the meta data file. The header line has the following format: The first column defines the taxonomic rank (e.g. OTU, species, genus) and the second column must be named "Header" (Calypso identifies header lines using this "Header" tag). The following columns represent sample ids. Example header line:

Genus, Header, SA1, SA2, SA3, SA4

The following rows provides the number of sequences assigned to each taxon. The first column defines the taxonomic rank (e.g. OTU, species or genus), the second column defines the taxon and the following columns represent the number of sequences of each sample assigned to that taxon. Example row:

Genus, Bifidobacterium, 349, 467, 1092, 12


In the Calypso V3 format, multiple taxonomic ranks can be combined in a single data matrix file. For each rank a separate header line has to be supplied. Example:

Family, Header, SA1, SA2, SA3, SA4
Family, Lachnospiraceae, 3849, 2374, 1287, 384
Family, Prevotellaceae, 848, 383, 4582, 234, 32
Family, Bifidobacteriaceae, 4743, 2748, 3822, 384

Genus, Header, SA1, SA2, SA3, SA4 
Genus, Enterococcus, 123, 3843, 2485, 234 
Genus, Streptococcus, 747, 422, 857, 58
Genus, Clostridium, 458, 12, 284, 864, 78
Genus, Eubacterium, 84, 12, 45, 23, 192
Genus, Unclassified, 348, 384, 1343, 485

Unclassified is a special tag for the number of reads that could not be assigned to any of the listed classes (taxa, OTUs, etc). Unclassified will be excluded during some type of analysis, e.g. rarefaction analysis of community diversity.

After creating and saving the file in in a spreadsheet programs such as Excel, open the file in Notepad or WordPad to control the format (e.g. ensure that values are not quoted and that the file is in either comma, tab or semicolon separated format).

The Calypso OTU table with tax information format is a simplied version of the Calypso v3 format and only provides taxonomic counts for the OTU level. Taxonomic counts for all other levels (species, genus, family, order, class and phylum) are automatically generated by Calypso.

Example data matrix in Calypso v3 format

Calypso data matrix.PNG


Example data matrix in Calypso OTU table with tax information format (comma separated)

OTU, Header, h278B.2, USygt45.T2, USygt45.T1, Amz7adltF, AmzC13babyF, Amz21chld, USygt36.T1, USygt36.T2, h165S

OTU, k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Peptococcaceae;g__;s__ 405, 520804, 20584, 114912, 612, 163, 1588, 34316, 96692, 254

OTU, k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhodospirillales;f__Rhodospirillaceae;g__uncultured;s__ 573, 61841, 134015, 4666, 8566, 649, 22989, 118644, 22878, 52945

OTU, k__Bacteria;p__Proteobacteria;c__Deltaproteobacteria;o__Desulfurellales;f__Desulfurellaceae;g__H16;s__uncultured bacterium 705, 68, 121828, 23050, 129, 91, 9, 281687, 184293, 1179

OTU, k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__;g__;s__ 768, 5079, 45258, 496, 26161, 4430, 58961, 16404, 98261, 120895

OTU, k__Bacteria;p__Saccharibacteria;c__uncultured bacterium;o__;f__;g__;s__ 375, 100774, 91010, 42211, 2532, 124, 17693, 45053, 93943, 57494

OTU, k__Bacteria;p__Bacteroidetes;c__Sphingobacteriia;o__Sphingobacteriales;f__Sphingobacteriaceae;g__Mucilaginibacter;s__ 478, 181189, 61491, 1, 16421, 346, 70059, 16, 6, 70121

OTU, k__Bacteria;p__Microgenomates;c__Candidatus Daviesbacteria;o__uncultured bacterium;f__;g__;s__ 413, 9227, 160161, 272497, 6877, 581, 15663, 13936, 70144, 51665

Mothur Summary.tax files

Calypso supports Mothur output files. More details can be found here.

Example detailed Summary.tax file:

taxlevel	 rankID	 taxon	 daughterlevels	 total	F11Fcsw	F12Fcsw	F13Fcsw	F14Fcsw	F21Fcsw	...	
0	0	Root	1	27517	699	666	925	1301	707	...	
1	0.1	Bacteria	10	27517	699	666	925	1301	707  ...	
2	0.1.1	"Acidobacteria"	1	270	61	52	46	75	2	...	
3	0.1.1.1	Holophagae	1	270	61	52	46	75	2	...	
4	0.1.1.1.1	Holophagales	1	270	61	52	46	75	2	...	
5	0.1.1.1.1.1	Holophagaceae	1	270	61	52	46	75	2	...	
6	0.1.1.1.1.1.1	Holophaga	0	270	61	52	46	75	2	...	
2	0.1.2	"Actinobacteria"	1	258	44	31	55	79	1	...
...

Example simple Summary.tax file:

taxon	total	A	B	C
"k__Bacteria";"p__Actinobacteria";"c__Actinobacteria";"o__Bifidobacteriales";"f__Bifidobacteriaceae";"g__Bifidobacterium";"s__";	1	0	1	0
"k__Bacteria";"p__Actinobacteria";"c__Actinobacteria";"o__Bifidobacteriales";"f__Bifidobacteriaceae";"g__Bifidobacterium";"s__adolescentis";	1	0	1	0
"k__Bacteria";"p__Actinobacteria";"c__Actinobacteria";"o__Bifidobacteriales";"f__Bifidobacteriaceae";"g__Bifidobacterium";"s__longum";	1	0	1	0
...

Distance matrix

Optionally, a distance matrix can be imported into Calypso, which can be subsequently utilized in distance-based methods, such as PCoA.

The matrix contains pairwise sample distances in comma, tab or semicolon separated format. Both row names and column names must be defined. A distance matrix can for example be calculated using UniFrac.

Example distance matrix

Calypso distance matrix.PNG

Taxonomy file

Calypso allows to select a reference taxonomy, which enables interactive visualizations of the community composition. The user can either select taxonomies already incorporate into the Calypso software or upload a custom taxonomy. Calypso supports custom taxonomies in file formats that are used by RDP or Greengenes. Taxa present in the data matrix file but absent in the taxonomy file will be ignored during the import and taxonomy based visualisations.

RDP uses Bergey's taxonomy. The custom file is in a XML format specified in the example below.

       <file>RDPExampleFile</file>
       <TreeNode name="Root" taxid="0" rank="rootrank" parentTaxid="-1" leaveCount="1" genusIndex="-1"></TreeNode>
       <TreeNode name="Bacteria" taxid="1" rank="domain" parentTaxid="0" leaveCount="1" genusIndex="-1"></TreeNode>
       <TreeNode name=""Firmicutes"" taxid="61" rank="phylum" parentTaxid="1" leaveCount="1" genusIndex="-1"></TreeNode>
       <TreeNode name=""Lactobacillales"" taxid="160" rank="order" parentTaxid="62" leaveCount="1" genusIndex="-1"></TreeNode>
       <TreeNode name="Streptococcaceae" taxid="203" rank="family" parentTaxid="160" leaveCount="2" genusIndex="-1"></TreeNode>
       <TreeNode name="Streptococcus" taxid="206" rank="genus" parentTaxid="203" leaveCount="1" genusIndex="2"></TreeNode>
       <TreeNode name="Lactococcus" taxid="204" rank="genus" parentTaxid="203" leaveCount="1" genusIndex="273"></TreeNode>

Calypso requires the attributes:

  • name: name of the taxon, which should be equivalent to the taxa name defined in the data matrix file,
  • taxid: a taxonomy ID for the taxon,
  • rank: either rootrank, domain, phylum, class, order, family or genus,
  • parentTaxid: taxonomy ID of the parent of the taxon in the taxonomic hierarchy.

The attribute values must always be enclosed in double quotation marks.

The taxonomy file in Greengenes format is a text file that lists the taxonomy id and the lineage of the taxon in each line as exemplified below.

       1043951 k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Propionibacteriaceae; g__Propionibacterium; s__acnes
       4449524 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Prevotellaceae; g__Prevotella; s__
       4464084 k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Paenibacillaceae; g__Paenibacillus; s__
       748636  k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Microbacteriaceae; g__Mycetocola; s__
       553810  k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__aeruginosa
       93366   k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Bacillaceae; g__; s__

Each line must start with a number (tax id) and contain superkingdom (k__), phylum (p__), class (c__), order (o__), family (f__), genus (g__) and species (s__), which all must separated by a semicolon followed by a space "; ". If no classification on a rank is available, the rank must still be listed with an empty taxon name (e.g. "g__;")

Example taxonomy file

750

Data upload

Data is uploaded via the Data upload page. Various normalization and data transformation methods are provided, including centered log ratio, log, asinh, square root transformation and quantile normalization.

Quantile normalization has been implemented using the normalise.quantile() function from the R preprocessCore library. The function is run with default parameters.

Data normalisation and transformation

Calypso provides various transformation methods to account for the generally non-normal distribution of microbial community composition data. To render the data suitable for analysis by standard statistical procedures, community profiles can be transformed by log, asinh and square root transformation. Sequencing based community profiling yields a special data type called “compositional data”, which is characterized by specific intrinsic properties that can deteriorate statistical analysis. The elements of a compositional data vector are non-negative and sum to a constant. To remove the non-independence of relative bacterial abundance, Calypso facilitates data transformation by centered log ratio, which is one of the most widely used transformations for compositional data.

Additionally, counts data can be normalised by total-sum normalization (TSS), the most commonly used approach. TSS divides read counts by the total number of reads in each sample and thereby converts read counts to appropriately scaled ratios.

Scale scales the data. For each feature (row of data matrix) the values are scaled to mean=0 and variance=1 using the R scale() function.

Filter rare taxa

Rare taxa, which likely represent sequencing errors, are excluded during data import. The threshold can be set using the "Filter rare taxa" parameter. The default value is 0.5%, meaning that only taxa are important, to which >0.5% of sequences are assigned in at least one sample.

Filter Taxa

Cyanobacteria and Chloroplasts can be excluded from the counts file using the "Filter Taxa" drop-down menu.