GMine File Upload

From wiki
Jump to: navigation, search

Input files

As input GMine requires a n x m data matrix and an annotation file providing metadata for each sample. Input files can be in tab separated or comma separated format and can be created in spreadsheet programs such as Excel. The data upload page provides several normalization and data transformation methods, including quantile normalization, variance stabilization and calibration for microarray data (vsn), centered log ratio, log or square root transformation.

Example input files

Example input files can be downloaded from here:


Example Data File

Example Meta Data (simple format)

Example Meta Data (basic format)

Example Meta Data (v6 format)

Example Meta Data (v3 format)

Meta data file

The meta data file can be in comma, tab, or semicolon separated format and can be created in Excel. Rows represent samples, columns meta information. All samples present in the data matrix file have to be defined in the meta data file.

GMine supports four different formats, the simple, basic, V6 and V3 format.

Simple format

The first column must list the ids of all samples. Additionally, one or more columns of metadata must be provided (study groups, biological condition, treatment etc). Example Meta Data (simple format)

Basic format

The first column must list the ids of all samples, the second column provides sample labels and the third column takes values 1 and 0 to specify which samples are included (1) and which samples are excluded (0) from all analysis. Additionally, one or more columns of metadata must be provided (study groups, biological condition, treatment etc). Example Meta Data (basic format)

Meta data file in V3 and V6 format

The meta data file in v3 format must have at least the following 6 columns in exactly this order:

Format meta data file
Column Name Description
1 Sample id The identifier of each sample. Sample ids must match sample ids of the uploaded data matrix
2 Label A unique sample label. These labels are shown in generated figures instead of the sample id.
3 Pair Individual or animal. This information is used for paired analysis if several samples were taken from the same individual, e.g. at different time points during a longitudinal study or at different locations from the same individual. Set pair to different ids (e.g. 1,2,3…. ) if each sample was taken from a different individual.
4 Primary group The primary group is the main sample group, e.g. cases/control, treated/untreated, and is used for most univariate analysis and to colour figures.
5 Secondary group A secondary sample group, e.g. case/control, geography, sampling time point. Also the second group can be used to colour figures and for univariate analysis. In some analysis, the secondary group has a defined meaning, see details below. For example, the secondary group is used as nested variable when comparing sample groups by nested anova. For paired analysis of longitudinal data, the secondary group is interpreted as sampling time point.
6 Include This column takes values 0 and 1 and indicates if a sample should be included (1) or excluded (0) from the analysis. This column can be used to exclude samples from subsequent analysis without modifying the data matrix. For example, problematic samples can be excluded from data analysis by simply setting their value to 0.
7, 8, 9, … Optional factors These factors are used in multivariate analysis, e.g. multivariate regression, redundancy analysis or canonical correspondence analysis


The V6 format is a simplified version of the V3 format. Example Meta Data (v6 format) Example Meta Data (v3 format)


Missing values

Note that NA is a reserved value and depicts missing values. Missing values must be given as NA.

Optional factors

Factors are used in multivariate analysis to examine complex associations. If primary and/or secondary groups should be included in the multivariate analysis, these fields have to be specified again as factors.

Factors can be numeric and/or categorical. Example factors are BMI, gender, age, tissue, batch, blood sugar level, temperature, time of sampling, sample location, or ph. Categorical variables must contain non-numeric characters (e.g. T1, T2, ..). Categorical variables must not be encoded numerically (e.g. 1,2,3,...).

Example: Assume a case/control study in which gender and age are potential confounding factors. To explore if case/control status controlled for gender and age explains variation in the data matrix, define case/control status, gender and age as factors in the meta data file. Case/control status can additionally be defined as primary group. Then, run a multivariate analysis, such as RDA++, CCA++ or multivariate regression.

Meta data file format

  • The first row of the meta data file is a header row. The names of each column can be set by the user, but the order of the columns must follow the above specification.
  • Fields must not be quoted
  • After creating and saving the file in in a spreadsheet programs such as Excel, open the file in Notepad or WordPad to control the format (e.g. ensure that values are not quoted and that the file is in either comma, tab or semicolon separated format).

Example meta data file in V6 format

DSMeta.png

Data matrix

The data matrix provides measurements for multiple features, for example antibody response against antigens, gene expression values, signal intensities or methylation levels.

The data matrix is a simple n x m matrix in text format. The file can either be comma, tab or semicolon separated. Rows represent features (e.g. genes), columns represent samples. The file can be created in spreadsheet programs such as Excel.

A header line must be provided as first row listing the ids of all included samples. These ids must match the sample ids defined in the meta data file. The first column represents the feature names, e.g. gene names, protein ids, or probe ids.

Additionally, a optional column providing meta information for each feature can be added. This column must be named "META".

After creating and saving the file in in a spreadsheet programs such as Excel, open the file in Notepad or WordPad to control the format (e.g. ensure that values are not quoted and that the file is in either comma, tab or semicolon separated format).

Example data matrix

DSDeta.png

Data upload

Data is uploaded via the Data upload page. Various normalization and data transformation methods are provided, including quantile normalization, variance stabilization and calibration for microarray data (vsn), centered log ratio, log, asinh or square root transformation.

Data normalisation and transformation

The data upload page provides several normalization and data transformation methods, including quantile normalization, variance stabilization and calibration for microarray data (vsn), centered log ratio, log or square root transformation. Quantile normalization is widely used for the normalization of microarrays and other biomolecular data. The vsn algorithm was originally developed for gene expression microarrays, but is now widely used for the normalization of protein array data. The VSN corrects for the dependency between intensity and variance, which is observed in may cases and can deteriorate the analysis results. The VSN algorithm transforms the data such that the variance remains nearly constant over the whole intensity spectrum. Centered log ratio is one of the most widely used transformations for compositional data, which arises in many fields including biology, chemistry, geology, archaeology, and economics. A compositional data vector is a special type of observation in which the elements of the vector are non-negative and sum to a constant. The components usually show the relative weight or importance of a set of parts in a total, for example the relative abundance of observed species. Compositional data have intrinsic properties such as a “constantsum constraint,” which can deteriorate results when statistically analyzing these data. Employment of standard statistical procedures on compositional data necessitates the use of appropriate transformation procedures, which removes the non-independence of data points.

Scale scales the data. For each feature (row of data matrix) the values are scaled to mean=0 and variance=1 using the R scale() function.