Genomics Data Miner

From wiki
Jump to: navigation, search

GMineLogo.png



The Genomics Data Miner (GMine) server is freely available at: http://cgenome.net/gmine/

Genomics Data Miner is an easy-to-use online software, allowing non expert users to mine, cluster and compare multidimensional biomolecular datasets. Various powerful visualization techniques are provided, generating high quality figures that can readily be used in scientific articles. Robust and thorough analyses are rapidly obtained via a broad range of algorithms for clustering, classification and statistical testing.


GMine is suitable for the analysis of genomics, metagenomics, transcriptomics and proteomics datasets with several hundred to a few thousand features (e.g. protein arrays or NanoString expression data). The software does not support the analysis of datasets with ten thousands of features, such as genome-wide expression arrays. However, subsets of genome-wide expression data can be analysed in GMine, e.g. expression of all immune-related genes.


StartUsingGMine.png


If you find this software useful, please help us by sharing GMine via social media.

Screenshots


DS1.png DS2.png NetworkAnalysis.png DS4.png DS3.png DS5.png DS6.png GMineFALogo.png




Contents

Disclaimer

This program is provided in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. The software may be used at your own risk. If you decide to use GMine in published work, it is YOUR responsibility to ensure the correctness and consistency of the data.

Introduction

GMine is a powerful, yet easy to use, tool for the higher-level analysis of biomolecular data. The software has been developed with a focus on protein microarrays, but can be used for any n x m data matrix (with n x m < 2M), where the columns correspond to samples (patients, controls, time points etc), the rows represent features which are the measured outcomes (e.g. genes, proteins, genomic regions) and the values of the data matrix represent intensities, frequencies or counts (e.g. signal intensities, expression values, methylation levels or number of observations).

In the field of infectious diseases, protein microarrays are widely used for measuring host immune response to proteins of infectious agents, including malaria, schistosomes or viruses. Protein microarrays allow measurement of antibody response (e.g. IgE or IgG) against the entire proteome of infectious agents, yielding large and complex datasets, providing signal intensities measured for thousands of proteins and hundreds of samples. Antibody response can be contributed to by multiple factors, including gender, age, geographic location, or infection history. The analysis of this type of data therefore usually requires advanced statistical methods that can account for the various contributing factors and infer complex associations between antibody response and clinical variables and other co-variates.

GMine provides a broad range of data-mining techniques allowing to perform quantitative visualizations (e.g. boxplots, bubbleplots and heatmaps), parametric and non-parametric statistical testing, univariate and multivariate analysis, supervised learning, correlation networks, clustering and multivariate regression. The software enables lab-based researchers, who may be unfamiliar with advanced statistical software packages, to use these complex types of analysis routinely in their work. The software generates publication-quality images in PDF, SVG or PNG format.

Tutorial

Input Data

As input GMine requires a n x m data matrix and an annotation file providing metadata for each sample. When generating your own data matrix, remember to follow the example data matrix file given in the demo project.

Data Matrix

The data matrix represents measurements (e.g. signal intensities) measured for multiple features (e.g. antibodies, genes, proteins). The values are numeric and continuous.

Meta annotation file

The annotation file provides meta information for each sample, including sample group (e.g. case/control). Additionally, multiple optional factors (also called environmental parameters) can be provided, which are explanatory variables. Factors can be independent variables (frequently manipulated by the experimenter, e.g. case/control) or confounding factors (e.g. age, gender, BMI). Factors can be discrete (e.g. case/control, gender, geography, treatment response) or numeric (e.g. age, days after treatment, BMI, blood glucose level). Complex associations between multiple factors and measurements (features) can be inferred using multivariate statistical methods. For example, the question "how does age, BMI and 'days-of-treatment' relate to or predict the measured outcome variable (feature) anti-xyz antibody level" can be answered.

Start demo project

Before taking this tutorial please upload an example dataset. The tutorial provides links to results pages of GMine. These links won't be functional if no data is uploaded.

To start the demo project:

  • Go to the Data Upload Page
  • Press Start Demo Project to automatically upload an example data and meta annotation file


The data of the demo project was generated in a prospective study investigating host immune response to Plasmodium falciparum (Crompton et al. PNAS 2010). Compton et al used a protein microarray consisting of 2,320 probes (representing ~23% of the P. falciparum proteome) to profile host immune response (IgG) against malaria parasite proteins. The cohort comprises 220 individuals in age ranges 210 years and 1825 years from Kabila, Mali. Samples were collected before (May) and after (December) the 6-months malaria season. The dataset was kindly provided by the authors.

Reference: Crompton et al. A prospective analysis of the antibody (Ab) response to Plasmodium falciparum before and after a malaria season by protein microarray. Proc. Natl Acad. Sci. USA.2010;107:69586963

Demo project data matrix

Columns represent samples and rows malaria parasite proteins (antigens or features). The matrix stores signal intensities, representing host immune response (IgG antibody response) against malaria proteins. A total of 155 subjects, 310 samples (2 samples per subject) and 249 malaria proteins were included in the data matrix of the demo project. Note that feature names (in this case the antigens) should not contains spaces or special characters like '+' and '-' . See data upload Wiki page for details.

Demo project data matrix

DSDeta.png

Demo project meta data file

The meta data file was created in Excel. Primary group (4th column) represents age group of included subjects (age group 1: 2-4 years, age group 2: 5 -10, age group 3: 18-25 years). Two samples were collected from each subject, one before the malaria season (May) and one after the malaria season (December). This variable was included as secondary group (5th column). The following factors were included for multivariate analysis:

  • parasitemic: 0/1; Indicating if subjects have microscopically detectable parasites at the time of sampling (1), or 0 if subjects are parasite free
  • hemoglobin type : AA, AC, CC, NA (not available); haemoglobin type of subjects
  • gender: male/female
  • malaria episode: number of clinical malaria episodes between May and December (range 0-5)
  • days until first episode: days until the first episode after the first sample collection in May (range 29-242). For example, a value of 242 indicates that a subject did not experience a clinical malaria episode during the first 8-months period after the first sample was taken in May.

See Data upload Wiki page (link below) and follow the guidelines for setting up the required columns for a meta data file


Demo project meta data file

DSMeta.png



Download the example data file representing signal intensities of 250 proteins in 224 individuals (310 samples in total) and the meta data file providing meta information for each sample.

Upload own data files

The data upload wiki page provides detailed information on how to upload and normalise your own data files.

Output

GMine generates high-quality figures in PNG, PDF or SVG format, which can readily be used in publications. Resolution, width and height can be specified as well as the used colour palette. Generated SVG images can be modified in vector graphics editors, such as Inkscape. This allows changing the image font size or colours and adding additional labels or features. GMine presents results also in sortable tables, which can be downloaded in comma-separated format.

Main Navigation Menu

GMine is accessible via a free web-server, which provides a user-friendly interface to advanced statistical methods and data-mining algorithms. The top menu provides links to the various data analysis pages, a help page and a tutorial.


DSMenue.png

Quantitative visualisation of measurements

Data is visualized quantitatively using heatmaps, bubble plots, scatter plots, stripcharts, barcharts or boxplots. The Sample Page uses bubble plots, heatmaps and bar charts to visualise measurements by square size, colour code and bar height, respectively. The data is visualized by each sample. The order that the samples are displayed in (eg. the order in which they are entered in the datafile or by primary group defined by the meta annotation file) can be selected with the 'Order' option. When visualizing data in heatmaps, GMine allows trimming of outliers, selection of a wide range of color palettes and adjustment of the color range center.

Generate Boxplot

  • Open the Sample Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set chart type to Boxplot
  • Choose a colour palette
  • Set Filter to 0 to include all features (malaria proteins) in the box plot
  • Press Draw Chart
  • To obtain high-quality figures for publication, set figure resolution, width and height.

The following two figures are obtained for the top 50 malaria proteins of the demo project (first figure) and all protein of the demo project (second figure). The x-axis represents samples, the y-axis represents host immune response to malaria proteins, the boxes represent distribution of host immune response. Samples are coloured by age group. The figures suggest, that immune response to malaria proteins increases with age from age group 1 (2-4 years of age) to age group 2 (5-10 years of age) to age group 3 (18-25 years of age).


DSBoxplot.png

DSBoxplotFull.png

Generate Barchart

  • Open the Sample Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set chart type to Barchart
  • Select the order of samples in the generated figure, e.g. by primary group (groupP), subject (Pair) or secondary group (groupS)
  • Choose a colour palette
  • Set Filter to 20 to only show the top 20 features with the highest mean across all samples
  • Press Draw Chart
  • To obtain high-quality figures for publication, set figure resolution, width and height. Increase the width and/or height if labels are displayed only partially or overlap, or if the legend and chart overlap
  • Colors are randomly assigned to each feature. Press Draw Chart again to re-assign colours.

The following bar chart is obtained for the demo project. Barchart are useful for presenting datasets with a low number of features, but provide only limited value for large datasets.

DSBarChart.png

Generate Heatmap

  • Set chart type to Heatmap+
  • Select BlueGoldRed colour palette
  • Set Filter to 0 to include all features
  • Select the ReOrderSamples checkbox. If ReOrderSamples is selected, samples will be ordered by hierarchical clustering. Otherwise they will be ordered as selected in the drop down menu Order.
  • Unselect the Scale data checkbox. If scale data is selected, values represented by heatmap rows will be scaled in range 0-1.
  • Set image resolution to 200, width to 530 and height to 200. These settings can be used to obtain high-quality figures for publication. Width and/or height can be increased if labels are displayed only partially or overlap, or if the legend and chart overlap
  • Press Draw Chart

Rows of the generated heatmap represent features, columns represent samples. Colour bars above heatmap represent values of factors specified in the meta annotation file

DSHeatMap1.png

Next:
  • Set Trim values to 40,000. All values above 40,000 will be trimmed to 40,000. This can be used to trim outliers.
  • Press Draw Chart
Next:
  • Set Color centre to 6,000. Color centre changes the distribution of assigned colours and allows shifting the colour palette. This can be used to increase the resolution for low data values. The intermediate colour of the chosen colour palette will be assigned to the entered value. For example, if BlueGoldRed is chosen as colour palette, and Colour centre is set to 6,000, then gold is assigned to values of around 6,000; blue is assigned to values<6,000 and red is assigned to values >6,000.
  • Press Draw Chart

The following figure is obtained for the example dataset: DSHeatmap.png

Next:
  • Choose BlueGreenYellow colour palette
  • Press Draw Chart

The following figure is obtained for the example dataset: DSHeatmap2.png


Next:
  • Redo the analysis using different colour palettes and different cut-offs for Trim and Colour centre.

Details

Click Details for description of heatmap, heatmap+, box plot and tables

Compare measurements across biological conditions

Switch to the Group Page. Functions in this tab are commonly used statistical tests for comparison of two or more groups and provide a graph of features that are significantly different by group as well as descriptive box plots, barcharts etc by group (unlike 'Sample' which gives plots by sample).

Statistical tests - Measurements across sample groups are compared by parametric and non-parametric statistical tests, including ANOVA, nested ANOVA, t-test, paired t-test, Bayesian t-test, Wilcoxon-rank test and MannWhitney U test (select the test that is appropriate for your data). Significantly different features are shown as box plot, bar chart or strip chart. Measurements can for example be gene expression values, signal intensities or methylation levels.

Features associated with different biological conditions can further be identified using the linear discriminant analysis (LDA) effect size method (LEfSe). LEfSe has been developed with a focus on metagenomic analysis and determines the features most likely to explain differences between classes by coupling standard tests for statistical significance with additional tests encoding biological consistency and effect relevance.

Tutorial

  • Open the Group Page. Make sure that the Demo Project has been loaded, as described above.
  • Select sample group (either primary or secondary group; these groups are defined in the meta information file uploaded by the user). The selected groups will be used for the comparison. Secondary group must be specified when using a nested ANOVA test (see details link below)
  • Press SelectMode
  • Set the plot type. AnovaPlot compares measurements (signal intensities for the example dataset) by Anova and visualizes features with significantly different measurements as bar chart, with standard deviation as error bars. RankTest compares data values by non-parametric rank test (Kruskal-Wallis). Significantly different features are visualized as bar chart. Error bars depict standard deviation. Significance of differences is depicted as: * (p<0.05), ** (p<0.01) and *** (p<0.001)
  • Set a significance threshold. Only features that are significantly different with a p-value below this threshold are shown
  • Press DrawChart

Click Details for details on ANOVA, nested ANOVA, Ranktests etc. The statistical test used should be appropriate for the data being tested. You will need to have some basic knowledge about what type of test is appropriate for your data. This will depend on your study design, the number of groups you are comparing, what kind of outcome variables (features) you have measured etc.

Visualize measurements of individual features

Switch to the Feature Page. Measurements of individual features across sample groups are visualised by box plots or stripcharts. Significance of differences can be tested using parametric or non-parametric statistical tests (select a test that is appropriate for your data).

The Feature tab is similar to the sample plots tab but allows selectioin of individual feature (e.g. gene or protein) for which a detailed plot will be drawn. Unlike sample plots or groups plots which show the distribution of values of all features in relation to the sample groups, this tab allows to select any ONE feature and visualize the distribution of the data in relation to the groups defined in the meta annotation file.


Tutorial

  • Open the Feature Page. Make sure that the Demo Project has been loaded, as described above.
  • Select the plot type (stripchart or box plot) via the Type drop down menu
  • Select a feature of interest via the Feature drop down menu
  • Press Draw Chart

Outliers can be removed from the plot by setting the Remove outliers option. If a value >0 is specified, all measurements above this value are excluded from the plot (they are however included in the calculation of the median/mean and quartiles).

Statistical comparison of biological conditions

The Stats Page allows an in-depth statistical comparison of measurements across sample groups. Sample groups are compared by parametric and non-parametric statistical tests, including anova, nested anova, t-test, paired t-test, Bayesian t-test, Wilcoxon-rank test and Kruskal-Wallis test. P-values are adjusted for multiple testing by FDR or Bonferoni correction.


Tutorial

  • Open the Stats Page. Make sure that the Demo Project has been loaded first, as described above.
  • Select a statistical test e.g. Anova or rank test. Rank test will compare measurements by Wilcoxon rank test for two groups and Kruskal-Wallist test for more than two sample groups
  • Set Filter to 50. Only the top 50 features will be displayed. To display all features set Filter to 0.
  • Press Select Mode
  • Select sample group (either primary or secondary group; these groups are defined in the meta information file uploaded by the user)
  • Press Do stats
  • Click on the output table header to sort table by p-value, mean, median etc. ex. If sort filter is set at 50 and you click on the table header of p-value, the table will reorganize to display the top 50 features with the lowest p values.

Results of the statistical analysis are presented as table. Shown are p-values, Bonferroni corrected p-values, false discovery rate (Benjamine-Hochberg) [1] and mean in each group.

Distribution of p-values is presented as histrogram and quantile-quantile (QQ) plot. In the lower left figure, the uniform (expected) p-value distribution is indicated by the red line 'Expected'. QQ plots characterize the extend to which the observed distribution of the tests statistics follows the expected (null) distribution. This allows the detection of evidence for systemic bias.

PValueDistribution.png

Random forest classification

Random forest classification can be applied to examine complex associations between multiple features (e.g. immune response to malaria parasite proteins) and a study variable (e.g. protected from malaria infection).

The following tutorial explains how to run a random forest analysis to identify proteins predictive of age group.

  • Open the Stats Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set statistical test to Random forest
  • Press Select Mode
  • Set Filter to 20; to reduce computing/waiting time, only the top 20 proteins will be included
  • Set Group by to Primary group. In the demo project, primary group represents age group.
  • Press Do stats
  • Click on the table header to sort table by Score (Mean Decrease Accuracy)

Malaria proteins most predictive of age group are now shown in the top of the table.

Random forest is implemented in R: rf<randomForest(as.factor(groups)~x,importance=T,proximity=F,ntree=10000,mtry=20) im<-as.data.frame(importance(rf)) score<-im$MeanDecreaseAccuracy

where groups represent the selected sample group and x represents the data matrix.

Details

Details

Multivariate analysis

The Multivariate Page facilitates multivariate data visualisation and multivariate statistical testing. Multivariate statistics are powerful techniques that can identify complex associations between measurements (data matrix) and multiple factors. In GMine, complex associations can be examined by the multivariate methods principal component analysis (PCA), redundancy analysis (RDA), canonical correspondence analysis (CCA), detrended correspondence analysis (DCA), non-metric multidimensional scaling (NMDS), hierarchical clustering, heatmaps, correlation networks and multivariate regression. Correlation networks visualize the positive and negative associations between features, between factors, and between features and factors.


To explore complex associations between measurements and multiple factors using multivariate statistics, set Type to Heatmap+, RDA+ or CCA+. If RDA+ or CCA+ are chosen, all factors defined in the meta annotation file are included. For each factor a p-value is computed, indicating if the factor is significantly associated with that feature (i.e. if the factor significantly explains variation in values of individual feature (RDA) or variation in sample distances (CCA+)).

Correlation heatmap

Pearson's correlation between factors and features (measurements) are shown as heatmap.

  • Set Type to Heatmap+
  • Press Select Mode
  • Press Draw Chart

In the figure below, positive correlations are shown in red, negative correlations in blue. DSCorrelationHeatmap.png

Principal component analysis (PCA)

PCA allows data visualisation as 2D plot, identifying sample clusters and identification of potentially problematic samples (outliers), which may need to be excluded from downstream analysis.

  • Open the Multivariate Page. Make sure that the demo project has been loaded first, as described above.
  • Set Type to PCA
  • Press Select mode
  • Set colour to blueYellowRed
  • Set Group/Colour by to Primary Group. Samples in the PCA plot will then be coloured according to their primary group.
  • Set Hull to Filled Spider. In the PCA plot, samples of one group will be connected by lines as a spider plot.
  • Press Draw Chart

The following PCA plot is obtained for the demo project. Sample seem to be cluttered by primary group (age group). To test if this grouping is significant, run a CCA (next section).

DSPCA.png

Canonical correspondence analysis (CCA)

The PCA plot presented above indicates that samples cluster by primary group (age group). The statistical significance of the observed clustering can be tested by CCA.

  • Set Type to CCA
  • Press Select mode
  • Set Group/Colour by to Primary Group. A CCA will be run for the selected group.
  • Set distance metric to Euclidian.
  • Press Draw Chart

CCA is a multivariate method that is used to explore complex associations between measured variables and multiple explanatory variables. CCA tests if variations in the data matrix can be explained by the selected sample group (the sample group selected under the drop down menu Group/Colour by).

Two figures are generated. The first shows an 2D ordination plot, indicating how well samples can be separated according to the sample groups. The second plot provides a p-value, indicating if the sample group significantly explains variations in the sample distances, or in other words if samples cluster significantly by sample group.

The following result is obtained for the demo project.

According to these results, age group (the primary group) significantly explains variations observed in antibody response (p=0.001).


DSCCA2.png

DSCCAP2.png

Principal coordinates analysis (PcOA)

PcOA allows data visualisation as 2D plot, identifying sample clusters and identification of potentially problematic samples (outliers), which may need to be excluded from downstream analysis.

  • Open the Multivariate Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set Type to PcOA
  • Press Select mode
  • Set distance metric to Euclidian. Measurment-profiles can be compared by a wide range of distance metrics, including Euclidian, Manhattan, inverse Pearson's correlation and Bray-Curtis index.
  • Press Draw Chart

DSPcOA.png

Redundancy analysis (RDA+)

RDA+ includes all factors defined in the meta annotation file.

  • Set Type to RDA+
  • Press Select mode
  • Set colour to blueYellowRed
  • Set Filter to 50; only the top 50 antigens will be included to speed up processing/waiting time
  • Press Draw Chart

RDA is a multivariate method that is used to explore complex associations between measured variables and multiple factors. All factors defined in the meta annotation file are included in the multivariate analysis. Two figures are generated. The first shows an 2D ordination plot, indicating how well samples can be separated according to the defined factors. Samples will be coloured by the variable selected under Group/Colour by. The p-vlaues reported in the second figure indicate if each factor is significantly associated with variation in the data matrix (i.e. if the factor significantly explains variation in sample distances).

The following result is obtained for the example dataset. Parasitemic (positive to malaria parasite), Malaria.Episodes.May.December (the number of malaria episodes between May and December) and Days.Until.First.Episode are significantly associated with variation in immune response (p<0.05). Gender and hemoglobin type are not association with host immune response to malaria proteins.

DSRDA.png

DSRDAP.png

Locally linear embedding (LLE)

Method for data reduction that is capable of generating highly nonlinear embeddings [2]. LLE has been implemented in R using the lle package.

T-Distributed Stochastic Neighbor Embedding (t-SNE)

t-distributed stochastic neighbor embedding (t-SNE) is a nonlinear dimensionality reduction technique developed by Laurens van der Maaten and Geoffrey Hinton [3]. In GMine, t-SNE is used to present data in a two dimensional plot. Similar objects are modeled by nearby points and dissimilar objects are modeled by distant points. t-SNE has been implemented using the tsne R package.

Support vector machine (SVM) classification

The discriminatory power of the uploaded data to distinguish between two sample groups (e.g. cases vs control) can be examined using a Support Vector Machine evaluated by leave one out cross validation. Or in other words, SVM leave one out cross-validation (SVM LOOC) can be used to assess if measurements of the data matrix are predictive of sample groups. The classification performance is described by overall accuracy, sensitivity and specificity.

For example, using the demo project SVM LOOC can be employed to examine if host immune response is predictive of sample time point (e.g. before and after malaria season) or if subjects are protected from malaria infection.

The chose sample group must have exactly two different values, e.g. case/control, protected/unprotected, male/female.

SVM LOOC works iteratively. In each step, one sample is excluded and an SVM is trained to discriminate between two classes. The trained SVM is then applied to predict the class of the excluded sample. This is repeated iteratively for each sample. Finally, the predicted class is compared with the known class of each sample to calculate the classification accuracy (percentage of correctly predicted class labels), sensitivity (TP/(TP+FN) and specificity (TP/(TP+FP)).

Tutorial

  • Set Type to SVM LOOC
  • Press Select mode
  • Group/Colour by to Secondary group, to run a SVM LOOC for the secondary group variable (sampling time point before (May) and after (December) malaria season.
  • Set Filter to 20. SVM LOOC is a time consuming analysis. To test this function, limit the number of included features (malaria proteins) to 10 to speed up the computation.
  • Press Draw Chart

The top 20 antigens are able to predict sampling time point with 69% accuracy. 81% of samples collected in December were correctly classified into this class (sensitivity = 0.89). 57% of samples collected in May were correctly classified into this class (sensitivity = 0.57).


DSSVM.png

Multivariable linear regression

The Regression Page facilitates multivariable linear regression. Multivariable linear regression is a powerful techniques that can identify complex associations between measurements (data matrix) and multiple factors. Multiple co-variates (variables controlled by the experimenter or confounding variables) can be included in the analysis.

Identify feature-factor associations by multivariate regression

  • Open the Regression Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set Regress by to Features vs Factors and press Set Mode
  • Set Filter to 30 to restrict the analysis to the top 30 features with the highest mean across samples
  • Press Run Analysis

The displayed table shows associations between multiple factors (as defined in the meta information file) and features (as provided by the uploaded data matrix). For each feature a regression model is fit, including the feature as dependent variable and all factors as explanatory variables:

feature = fa1 + fa2 + fa3 …,

where fa1, fa2, ... are the factors (all factors defined in the meta annotation file). P-values are shown for each factor-feature combination, indicating the significance of associations. Click on the header to sort the displayed table, e.g. by p-value.

The following results table is obtained for the demo project. The table was sorted by p-values obtained for factor "parasitemic" (by clicking on Parasitemic.p). Parasitemic (0/1) represents if subjects were negative or positive for the malaria parasite. A number of antigens are significantly associated with this factor (p<0.05), indicated in blue. Thirteen antigens are still significantly associated with the parasitemic factor after correction for multiple testing by FDR (column Parasitemic.p.fdr). Significant associations for other factors can be viewed by re-sorting the table.

DSRegressionTable.png

Explore associations between single feature and all factors by multivariable regression

Host immune response to different malaria proteins is highly correlated. It is therefore not an easy task to assess if immune response to a specific protein is protective against malaria . Complex associations of this type can be explored by multivariate regression.

  • Open the Regression Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set Regress by to Malaria Episodes May.December to explore associations between the number of malaria episodes between Many and December and host immune response to malaria proteins
  • Press Set Mode
  • Press Run Analysis
  • Click on P in the generated table to order data by p-value

For each malaria protein (feature) the table shows the correlation R between host immune response to that protein and the number of malaria episodes between May and December. A p-value is given, indicating if the observation correlation is significant. Additionally, the table presents the mean signal intensity (immune response) across all samples and the number of positive samples (immune respond >0).

The top four proteins with the highest abs(R2) (absolute of correlation coefficient) are selected and the association of these proteins with the number of malaria episodes is examined by multivariable linear regression. This model incorporates malaria episodes as dependent variable and the top four malaria proteins as explanatory variables: number of episodes ~ protein 1 + protein 2 + protein 3 + protein 4

The results are presented as figure. For each included protein the p-value computed by multivariate regression is reported, indicating if the protein is significantly associated with the number of malaria episodes. Additionally, a scatter plot is shown, plotting the signal intensity of that protein (host immune response) versus the number of malaria episodes between May and December.

DSRegressionEpisodes.png

Details Multivariable Linear Regression

Details

Feature selection

The Feature Selection Page provides standard feature selection methods to identify relevant features associated with an outcome of interest. Feature selection methods are based on the premise that data frequently contains many features that are either redundant or irrelevant, and can thus be removed without incurring much loss of information[4]. Feature selection algorithms identify new feature subsets that best predict an outcome of interest (in this case a factor defined in the meta annotation file).

Available methods:

  • Step-wise regression: step-wise regression implemented using the step() function (R stats package)
  • LASSO regularised regression: LASSO performs both feature selection and regularisation to prevent overfitting. The method is implemented via the cv.glmnet() function from the R glmnet package.
  • Random Forest: Feature selection by random forest, implemented via the randomForest() function from the R randomForest package

Tutorial Feature Selection by Step-Wise Regression

The following tutorial identifies the subset of relevant malaria proteins predicting the number of patient malaria episodes between May and December.

  • Open the Feature Selection (FS) Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set Method to Step-wise regression and press Select Mode
  • Set Group by to Malaria Episodes May.Dec"
  • Set Direction by to Forward & Backward"
  • Press Run Analysis

A linear regression model is selected in an iterative approach (forward stepwise regression). The shown figure provides details of the final, selected regression model. Shown are the number of features that were included in the analysis, the number of selected features, and the AIC and Area Under the Curve (AUC) of the final (selected) model.

The AIC (Akaike information criterion) is a measure of the relative quality of the regression model and is used to compare and select models. Lower AIC values indicate relatively better models. AIC is a relative estimate of the information lost when a given model is used to represent the data. It deals with the trade-off between the goodness of fit of the model and the complexity of the model [5].

The selected features are shown in the centre of the figure. Bars for each feature indicate the AIC of the model if that feature was dropped from the model. A large relative increase in AIC indicates a large relative importance of the feature. The AIC of the model including all selected features is shown at the top bar and dashed red line.

The box plot in the lower left of the figure examines the quality of the selected model, i.e. how well the model represents the original data. For the outcome of interest, the plot shows the original value versus the value that is modelled by the model. When following the above tutorial, the plot shows the number of malaria episodes vs. the number of malaria episodes predicted by the final regression model (the final regression model includes all selected features).

In GMine, model selection by AIC in a stepwise algorithm is implemented using the R step() function.

Tutorial Feature Selection by Random Forest

  • Open the Feature Selection Page. Make sure that the Demo Project has been loaded first, as described above.
  • Select Random Forest method and press Select Mode
  • Press Draw Chart

The features selected by random forest are shown as bar plot, bars represent the importance of each feature (estimated by permutation, as returned by the importance() function from the R randomForest package).

Network analysis

Correlations between features can be presented as Network. Positive correlations are shown as yellow edges, negative correlations as blue edges and features as nodes. Network analysis identifies co-occurring and mutual exclusive features and clusters of correlating features.

Tutorial

  • Open the Network Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set Type to Network+ to do a network analysis of all features and all factors defined in the uploaded meta information file.
  • Set color to select a black or white background colour
  • Correlation coefficient can be set to either Pearson's correlation or Spearman coefficient
  • Press Draw Chart


Only correlations larger than Edge Min Similarity or smaller than -1 * Edge Min Similarity are presented as edges.

The following network is obtained for the demo project. Immune response to malaria proteins (nodes) is highly positively correlated (yellow edges) and most malaria proteins form one dense cluster.

DSNetwork.png


Weighted correlation network analysis (WGCNA)

Weighted correlation network analysis (WGCNA) can be used for finding clusters (modules) of highly correlated features (e.g. genes), for relating modules to external sample traits (using eigengene network methodology), and for calculating module membership measures. Relating modules instead of nodes to a sample trait can alleviate the multiple testing problem. Correlation networks facilitate network based screening methods that can be used to identify candidate biomarkers or therapeutic targets. A detailed description of the algorithm can be found here.


Tutorial

  • Open the WGCNA Page. Make sure that the Demo Project has been loaded first, as described above.
  • Select sample trait using the Relate modules to drop down menu
  • Press Run


Four figures are generated. First figure: Modules of correlating features are identified by hierarchical clustering. The shown dendrogram is based on the TOM dissimilarity. Highly correlating features are clustered into modules (presented as coloured bar; each colour represents one module). Figure 2: To identify biologically interesting modules, each module (represented by an eigengene) is associated with the selected trait. The second figure presents a p-value for each module, indicating the significance of the association between the module and the trait (student asymptotic p-value for given correlations). Third figure: The significance of associations between individual features and the selected trait is presented as box plot. The significance is shown separately for each module. Fourth figure: Hierarchical clustering of the eigengenes representing each module. The figure shows the similarity of modules.

Additionally, a table is generated presenting the features assigned to each module. Also the significance of each feature is shown in relation to the selected trait (high values indicate high significance). The connectivity measures how correlated a feature is with all other network features. This information allows the identification of highly connected genes (hubs), which are potential key drivers of the modules.

WGCNA.png

Biomarker Discovery

GMine provides simple yet powerful methods for biomarker discovery. Predictive biomarkers associated with two sample groups (e.g. cases and controls; responders and non-responsers) are identified by t-test, Wilcoxon rank test, nested anova, logistic regression or the random forest classifier. The discriminatory power of biomarker candidates is described by the area under the ROC curve (AUC), odds ratio, delta (difference in means in units of standard deviation) or fold change.

Biomarker discovery is only possible for variables/factors with exactly two different values (e.g. cases and control).

Tutorial

  • Open the Biomarker Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set Group by to Secondary Group
  • Set Filter to 30 to only include the top 30 features with highest mean value across samples
  • Press Draw Chart


For each feature a p-value, adjusted p-values (FDR and Bonferroni), Area Under the ROC Curve (AUC with 95% upper and lower confidence intervals), odds ratio (with 95% upper and lower confidence intervals), delta (difference in mean divided by standard deviation), and fold change are calculated. Odds ratios are visualized as forest plots. Together these values indicate if each feature is a potential predictive biomarker for classifying samples into the two classes.

If test is set to "LogisticRegression", p-value are calculated by logistic regression, incorporating the selected group as dependent variable and all factors as explanatory variables (group ~ feature + factor 1 + factor 2 + factor 3 + .... All factors defined in the meta annotation file are included. Also AUC and odds ratio will be adjusted for all factors. In detail: odds ratios are calculated by logistic regression, all factors are incorporated as explanatory variables.

The following plot shows the forest plot obtained for the demo project.


DSBiomarker.png

Paired analysis

The Pairwise Page facilitates paired statistical analysis. Paired analysis makes use of paired study designs, where several samples where taken from the same individual (e.g. before and after treatment). Comparisons are done by paired t-test or paired Wilcoxon rank test.

Tutorial

  • Open the Pairwise Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set Type to Scatter Plot
  • Press Select Mode
  • Set secondary group G2 to 2.After
  • Select test for paired analysis (paired Wilcoxon rank test or paired t-test)
  • Press Draw Chart

A scatter plot is shown. The x and y-axis' represent the values of the two selected secondary groups. Each dot represents one feature of the data matrix. In the demo project, features are malaria proteins, the secondary group represents the time point of sample collection (before and after malaria season) and data values represent measured signal intensities, representing host immune response to malaria proteins. A p-value is provided for each primary group, indicating if measurements are significantly different between the two compared time points (secondary group). The following plot is obtained for the demo project. Dots are significantly shifted towards the upper left, indicating that signal intensities (host immune response to malaria proteins) are generally higher after the malaria season (y-axis). This shift is significant for each of the three age groups, as indicated by the p-values at the top of the plot.

DSPaired.png


Details

General Options

The following table described general GMine options that are present on most analysis pages.

FieldDescription
Figure FormatThe file format of generated figures (PNG, PDF or SVG).
OrderThe order of samples in the generated figures. Samples can be ordered by their primary group, secondary group, pair, or label.
FilterSelects how many of the top features (highest mean across samples) are included. Set to 0 to include all features.
ColorColor palette for plots
Secondary GroupUsed to filter samples by their secondary group. Select "All" to include all samples.
DistanceDistance metric for computing pair-wise distances of samples.
ResolutionResolution of the generated plot in dpi. Allowed range: 20-1000
WidthWidth of generated plot in mm.
HeightHeight of generated plot in mm.

Implementation

The GMine web-frontend is implemented in Java using the JavaServer Faces architecture. Interactive views are facilitated by the Javascript library D3.js. The backend is implemented in Perl and the R statistical programming language. No installation, configuration, registration or login is required. Data is kept privately and cannot be viewed by other users. Uploaded data and calculated results are deleted after the users session terminates.


FAQ

Error message: An internal error occurred, likely you either didn't upload a data and meta data file or your session has expired

Before using GMine, you need to upload a data and meta data file. Select Home in the top menu and upload a data and meta data file or press "Start Demo Project". Make sure that you have activated java script in your browser settings and allow cookies. If you receive this error message after pressing "Start Demo Project" your browser does not allow cookies.

You also receive this error message if your session has expired. Your session expires if you haven't used GMine for more than 60 minutes.

Figure labels are only partially displayed

Increase the figure width and height.

CCA changes if "Color by" is changed, but CCA+ does not

CCA provides a p-value if variance observed in the community composition can be explained by the sample groups. The sample grouping can be selected by "Color by". CCA+ describes if variance in the community composition can be attributed to environmental variables. CCA+ does not include the sample groups for the statistical analysis. However, samples are colored by the grouping selected under "Color by".

Figure labels overlap

Increase the figure width and height.

Figure legend overlaps with chart

Increase the figure width and height.

Figures are completely white

Solution: Reduce resolution of figure or increase figure width and height.

Error message: Internal ERROR: null dataMatrix

Likely you forgot to upload a data and/or meta data file. Select Home in the top menu and upload a data and meta data file.

Error message: j_id_id15:resolution: Validation Error: Value is not of the correct type

You have entered an incorrect value in one of the text fields. Valid value ranges are: Resolution: 20-1000 Width: 20-10000 Height: 20-10000 Min proportion: 0-100, real numbers are supported, e.g. 0.2