Genomics Data Miner

From wiki
Revision as of 02:59, 21 March 2016 by Lutz (Talk) (Details)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

URL DataSmart server: http://cgenome.net/datasmart/

Contents

Disclaimer

This program is provided in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. The software may be used at your own risk. If you decide to use DataSmart in published work, it is YOUR responsibility to ensure the correctness and consistency of the data.

Screenshots

DS1.png DS2.png DS3.png DS4.png DS5.png DS6.png

DSLogo.png

Introduction

DataSmart is a powerful, yet easy to use, tool for the higher-level analysis of biomolecular data. The software has been developed with a focus on protein microarrays, but can be used for any n x m data matrix (with n x m < 2M), where the columns correspond to samples, the rows represent features (e.g. genes, proteins, genomic regions) and the values of the data matrix represent intensities, frequencies or counts (e.g. signal intensities, expression values, methylation levels or number of observations).

In the field of infectious diseases, protein microarrays are widely used for measuring host immune response to proteins of infectious agents, including malaria, schistosomes or viruses. Protein microarrays allow measurement of antibody response (e.g. IgE or IgG) against the entire proteome of infectious agents, yielding large and complex datasets, providing signal intensities measured for thousands of proteins and hundreds of samples. Antibody response can be contributed to multiple factors, including gender, age, geographic location, or infection history. The analysis of this type of data therefore usually requires advanced statistical methods that can account for the various contributing factors and infer complex associations between antibody response and clinical variables and other co-variates.

DataSmart provides a broad range of data-mining techniques allowing to perform quantitative visualizations (e.g. boxplots, bubbleplots and heatmaps), parametric and non-parametric statistical testing, univariate and multivariate analysis, supervised learning, correlation networks, clustering and multivariate regression. The software enables lab-based researchers, who may be unfamiliar with advanced statistical software packages, to use these complex types of analysis routinely in their work. The software generates publication-quality images in PDF, SVG or PNG format.

Tutorial

Input Data

As input DataSmart requires a n x m data matrix and an annotation file providing metadata for each sample.

Data Matrix

The data matrix represents measurements (e.g. signal intensities) measured for multiple features (e.g. antibodies, genes, proteins). The values are numeric and continuous.

Meta annotation file

he annotation file provides meta information for each sample, including sample group (e.g. case/control). Addittionally, multiple optional factors (also called environmental parameters) can be provided, which are explanatory variables. Factors can be independent variables (frequently manipulated by the experimenter, e.g. car/control) or confounding factors (e.g. age, gender, BMI). Factors can be discrete (e.g. case/control, gender, geography, treatment response) or numeric (e.g. age, days after treatment, BMI, blood glucose level). Complex associations between multiple factors and measurements (features) can be inferred using multivariate statistical methods.

Start demo project

Before taking this tutorial please upload an example dataset. The tutorial provides links to results pages of DataSmart. These links won't be functional if no data is uploaded.

To start the demo project:

  • Go to the Data Upload Page
  • Press "Start Demo Project" to automatically upload an example data and meta annotation file


The data of the demo project was generated in a prospective study investigating host immune response to Plasmodium falciparum (Crompton et al. PNAS 2010). Compton et al used a protein microarray consisting of 2,320 probes (representing ~23% of the P. falciparum proteome) to profile host immune response (IgG) against malaria parasite proteins. The cohort comprises 220 individuals in age ranges 2–10 years and 18–25 years from Kabila, Mali. Samples were collected before (May) and after (December) the 6-months malaria season. The dataset was kindly provided by the authors.

Reference: Crompton et al. A prospective analysis of the antibody (Ab) response to Plasmodium falciparum before and after a malaria season by protein microarray. Proc. Natl Acad. Sci. USA.2010;107:6958–6963

Demo project data matrix

Columns represent samples and rows malaria parasite proteins (antigens or features). The matrix stores signal intensities, representing host immune response (IgG antibody response) against malaria proteins. A total of 155 subjects, 310 samples (2 samples per subject) and 249 malaria proteins were included in the data matrix of the demo project.

Demo project data matrix

DSDeta.png

Demo project meta data file

The meta data file was created in Excel. Primary group (4rd column) represents age group of included subjects (age group 1: 2-4 years, age group 2: 5 -10, age group 3: 18-25 years). Two samples were collected from each subject, one before the malaria season (May) and one after the malaria season (December). This variable was included as secondary group (5th column). The following factors were included for multivariate analysis:

  • parasitemic: 0/1; Indicating if subjects have microscopically detectable parasites at the time of sampling (1), or 0 if subjects are parasite free
  • hemoglobin type : AA, AC, CC, NA (not available); haemoglobin type of subjects
  • gender: male/female
  • malaria episode: number of clinical malaria episodes between May and December (range 0-5)
  • days until first episode: days until the first episode after the first sample collection in May (range 29-242). For example, a value of 242 indicates that a subject did not experience a clinical malaria episode during the first 8-months period after the first sample was taken in May.

       


Demo project meta data file

DSMeta.png



Download the example data file representing signal intensities of 250 proteins in XX individuals and the meta data file providing meta information for each sample.

Upload own data files

The data upload wiki page provides detailed information on how to upload your own data and meta information files.


Output

DataSmart generates high-quality figures in PNG, PDF or SVG format, which can readily be used in publications. Resolution, width and height can be specified as well as the used colour palette. Generated SVG images can be modified in vector graphics editors, such as Inkscape, GIMP, Adobe Illustrator, Adobe Flash Professional or CorelDRAW. This allows changing the image font size or colours and adding additional labels or features. DataSmart presents results also in sortable tables, which can be downloaded in comma-separated format.

Main Navigation Menu

DataSmart is accessible via a free web-server, which provides a user-friendly interface to advanced statistical methods and data-mining algorithms. The top menu provides links to the various data analysis pages, a help page and a tutorial.


DSMenue.png

Visualise measurements

Data is visualized quantitatively using heatmaps, bubble plots, scatter plots, stripcharts, barcharts or boxplots. The SamplePlots Page uses bubble plots, heatmaps and bar charts to visualise measurements by square size, colour code and bar height, respectively. When visualizing data in heatmaps, DataSmart allows trimming of outliers, selection of a wide range of color palettes and adjustment of the color range center.

Generate Boxplot

  • Open the SamplePlots Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set chart type to Boxplot
  • Choose a colour palette
  • Set Filter to 0 to include all features (malaria proteins) in the box plot
  • Press Draw Chart
  • To obtain high-quality figures for publication, set figure resolution, width and height.

The following two figures are obtained for the top 50 malaria proteins of the demo project (first figure) and all protein of the demo project (second figure). The x-axis represents samples, the y-axis represents host immune response to malaria proteins, the boxes represent distribution of host immune response. Samples are coloured by age group. The figures suggest, that immune response to malaria proteins increases with age from age group 1 (2-4 years of age) to age group 2 (5-10 years of age) to age group 3 (18-25 years of age).


DSBoxplot.png

DSBoxplotFull.png

Generate Barchart

  • Open the SamplePlots Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set chart type to Barchart
  • Select the order of samples in the generated figure, e.g. by primary group (groupP), subject (Pair) or secondary group (groupS)
  • Choose a colour palette
  • Set Filter to 20 to only show the top 20 features with the highest mean across all samples
  • Press Draw Chart
  • To obtain high-quality figures for publication, set figure resolution, width and height. Increase the width and/or height if labels are displayed only partially or overlap, or if the legend and chart overlap
  • Colors are randomly assigned to each feature. Press Draw Chart again to re-assign colours.

The following bar chart is obtained for the demo project. Barchart are useful for presenting datasets with a low number of features, but provide only limited value for large datasets.

DSBarChart.png

Generate Heatmap

  • Set chart type to Heatmap+
  • Select BlueGoldRed colour palette
  • Set Filter to 0 to include all features
  • Select the ReOrderSamples checkbox. If ReOrderSamples is selected, samples will be ordered by hierarchical clustering. Otherwise they will be ordered as selected in the drop down menu Order.
  • Unselect the Scale data checkbox. If scale data is selected, values represented by heatmap rows will be scaled in range 0-1.
  • Set image resolution to 200, width to 530 and height to 200. These settings can be used to obtain high-quality figures for publication. Width and/or height can be increased if labels are displayed only partially or overlap, or if the legend and chart overlap
  • Press Draw Chart

Rows of the generated heatmap represent features, columns represent samples. Colour bars above heatmap represent values of factors specified in the meta annotation file

DSHeatMap1.png

Next:

  • Set Trim values to 40,000. All values above 40,000 will be trimmed to 40,000. This can be used to trim outliers.
  • Press Draw Chart

Next:

  • Set Color centre to 6,000. Color centre changes the distribution of assigned colours and allows shifting the colour palette. This can be used to increase the resolution for low data values. The intermediate colour of the chosen colour palette will be assigned to the entered value. For example, if BlueGoldRed is chosen as colour palette, and Colour centre is set to 6,000, then gold is assigned to values of around 6,000; blue is assigned to values<6,000 and red is assigned to values >6,000.
  • Press Draw Chart

The following figure is obtained for the example dataset: DSHeatmap.png


Next:

  • Choose BlueGreenYellow colour palette
  • Press Draw Chart

The following figure is obtained for the example dataset: DSHeatmap2.png


Next:

  • Redo the analysis using different colour palettes and different cut-offs for Trim and Colour centre.

Details

Details

Compare measurements across sample groups

Switch to the GroupPlots Page. Measurements across sample groups are compared by parametric and non-parametric statistical tests, including anova, nested anova, t-test, paired t-test, Bayesian t-test, Wilcoxon-rank test and Mann–Whitney U test. Significant different features are shown as box plot, bar chart or strip chart. Measurements can for example be gene expression values, signal intensities or methylation levels.


Tutorial

  • Open the GroupPlots Page. Make sure that the Demo Project has been loaded, as described above.
  • Select sample group (either primary or secondary group; these groups are defined in the meta information file uploaded by the user). The selected groups will be used for the comparison.
  • Press SelectMode
  • Set the plot type. AnovaPlot compares measurements (signal intensities for the example dataset) by Anova and visualises features with significantly different measurements as bar chart, with standard deviation as error bars. RankTest compares data values by non-parametric rank test (Kruskal-Wallis). Significantly differentially features are visualised as bar chart. Error bars depict standard deviation. Significance of differences is depicted as: * (p<0.05), ** (p<0.01) and *** (p<0.001)
  • Set a significance threshold. Only features that are significantly different with a p-value below this threshold are shown
  • Press DrawChart

Details

Statistical comparison of sample groups

The Stats Page allows an in-depth statistical comparison of measurements across sample groups. Sample groups are compared by parametric and non-parametric statistical tests, including anova, nested anova, t-test, paired t-test, Bayesian t-test, Wilcoxon-rank test and Kruskal-Wallis test. P-values are adjusted for multiple testing by FDR or Bonferoni correction.


Tutorial

  • Open the Stats Page. Make sure that the Demo Project has been loaded first, as described above.
  • Select a statistical test e.g. Anova or rank test. Rank test will compare measurements by Wilcoxon rank test for two groups and Kruskal-Wallist test for more than two sample groups
  • Set Filter to 50. Only the top 50 features with the highest mean across all samples are included.
  • Press Select Mode
  • Select sample group (either primary or secondary group; these groups are defined in the meta information file uploaded by the user)
  • Press Do stats
  • Click on the table header to sort table, e.g. by p-value.

Results of the statistical analysis are presented as table. Shown are p-values, Bonferroni corrected p-values, false discovery rate (Benjamine-Hochberg) [1] and mean in each group.

Distribution of p-values is presented as histrogram and quantile-quantile (QQ) plot. In the lower left figure, the uniform (expected) p-value distribution is indicated by the red line 'Expected'. QQ plots characterize the extend to which the observed distribution of the tests statistics follows the expected (null) distribution. This allows the detection of evidence for systemic bias.

PValueDistribution.png

Random forest classification

Random forest classification can be applied to examine complex associations between multiple features (e.g. immune response to malaria parasite proteins) and a study variable (e.g. protected from malaria infection).

The following tutorial explains how to run a random forest analysis to identify proteins predictive of age group.

  • Open the Stats Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set statistical test to "Random forest"
  • Press Select Mode
  • Set Filter to 20; to reduce computing/waiting time, only the top 20 proteins will be included
  • Set "Group by" to "Primary group". In the demo project, primary group represents age group.
  • Press Do stats
  • Click on the table header to sort table by "Score (Mean Decrease Accuracy) "

Malaria proteins most predictive of age group are now shown in the top of the table.

Random forest is implemented in R: rf<randomForest(as.factor(groups)~x,importance=T,proximity=F,ntree=10000,mtry=20) im<-as.data.frame(importance(rf)) score<-im$MeanDecreaseAccuracy

where groups represent the selected sample group and x represents the data matrix.

Details

Details

Multivariate analysis

The Multivariate Page facilitates multivariate data visualisation and multivariate statistical testing. Multivariate statistics are powerful techniques that can identify complex associations between measurements (data matrix) and multiple factors. In DataSmart, complex associations can be examined by the multivariate methods principal component analysis (PCA), redundancy analysis (RDA), canonical correspondence analysis (CCA), detrended correspondence analysis (DCA), non-metric multidimensional scaling (NMDS), hierarchical clustering, heatmaps, correlation networks and multivariate regression. Correlation networks visualize the positive and negative associations between features, between factors, and between features and factors.


To explore complex associations between measurements and multiple factors using multivariate statistics, set Type to Heatmap+, RDA+ or CCA+. If RDA+ or CCA+ are chosen, all factors defined in the meta annotation file are included. For each factor a p-value is computed, indicating if the factor is significantly associated with that feature (i.e. if the factor significantly explains variation in values of individual feature (RDA) or variation in sample distances (CCA+)).

Correlation heatmap

Pearson's correlation between factors and features (measurements) are shown as heatmap.

  • Set Type to Heatmap+
  • Press Select Mode
  • Press Draw Chart

In the figure below, positive correlations are shown in red, negative correlations in blue. DSCorrelationHeatmap.png

Principal component analysis (PCA)

PCA allows data visualisation as 2D plot, identifying sample clusters and identification of potentially problematic samples (outliers), which may need to be excluded from downstream analysis.

  • Open the Multivariate Page. Make sure that the demo project has been loaded first, as described above.
  • Set Type to PCA
  • Press Select mode
  • Set colour to blueYellowRed
  • Set Group/Colour by to Primary Group. Samples in the PCA plot will then be coloured according to their primary group.
  • Set Hull to Filled Spider. In the PCA plot, samples of one group will be connected by lines as a spider plot.
  • Press Draw Chart

The following PCA plot is obtained for the demo project. Sample seem to be cluttered by primary group (age group). To test if this grouping is significant, run a CCA (next section).

DSPCA.png

Canonical correspondence analysis (CCA)

The PCA plot presented above indicates that samples cluster by primary group (age group). The statistical significance of the observed clustering can be tested by CCA.

  • Set Type to CCA
  • Press Select mode
  • Set Group/Colour by to Primary Group. A CCA will be run for the selected group.
  • Set distance metric to "Euclidian".
  • Press Draw Chart

CCA is a multivariate method that is used to explore complex associations between measured variables and multiple explanatory variables. CCA tests if variations in the data matrix can be explained by the selected sample group (the sample group selected under the drop down menu "Group/Colour by").

Two figures are generated. The first shows an 2D ordination plot, indicating how well samples can be separated according to the sample groups. The second plot provides a p-value, indicating if the sample group significantly explains variations in the sample distances, or in other words if samples cluster significantly by sample group.

The following result is obtained for the demo project.

According to these results, age group (the primary group) significantly explains variations observed in antibody response (p=0.001).


DSCCA2.png

DSCCAP2.png

Principal coordinates analysis (PcOA)

PcOA allows data visualisation as 2D plot, identifying sample clusters and identification of potentially problematic samples (outliers), which may need to be excluded from downstream analysis.

  • Open the Multivariate Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set Type to PcOA
  • Press Select mode
  • Set distance metric to "Euclidian". Measurment-profiles can be compared by a wide range of distance metrics, including Euclidian, Manhattan, inverse Pearson's correlation and Bray-Curtis index.
  • Press Draw Chart

DSPcOA.png

Redundancy analysis (RDA+)

RDA+ includes all factors defined in the meta annotation file.

  • Set Type to RDA+
  • Press Select mode
  • Set colour to blueYellowRed
  • Set Filter to 50; only the top 50 antigens will be included to speed up processing/waiting time
  • Press Draw Chart

RDA is a multivariate method that is used to explore complex associations between measured variables and multiple factors. All factors defined in the meta annotation file are included in the multivariate analysis. Two figures are generated. The first shows an 2D ordination plot, indicating how well samples can be separated according to the defined factors. Samples will be coloured by the variable selected under "Group/Colour by". The p-vlaues reported in the second figure indicate if each factor is significantly associated with variation in the data matrix (i.e. if the factor significantly explains variation in sample distances).

The following result is obtained for the example dataset. Parasitemic (positive to malaria parasite), Malaria.Episodes.May.December (the number of malaria episodes between May and December) and Days.Until.First.Episode are significantly associated with variation in immune response (p<0.05). Gender and hemoglobin type are not association with host immune response to malaria proteins.

DSRDA.png

DSRDAP.png

Support vector machine classification (SVM)

The discriminatory power of the uploaded data to distinguish between two sample groups (e.g. cases vs control) can be examined using a Support Vector Machine evaluated by leave one out cross validation. Or in other words, SVM leave one out cross-validation (SVM LOOC) can be used to assess if measurements of the data matrix are predictive of sample groups. The classification performance is described by overall accuracy, sensitivity and specificity.

For example, using the demo project SVM LOOC can be employed to examine if host immune response is predictive of sample time point (e.g. before and after malaria season) or if subjects are protected from malaria infection.

The chose sample group must have exactly two different values, e.g. case/control, protected/unprotected, male/female.

SVM LOOC works iteratively. In each step, one sample is excluded and an SVM is trained to discriminate between two classes. The trained SVM is then applied to predict the class of the excluded sample. This is repeated iteratively for each sample. Finally, the predicted class is compared with the known class of each sample to calculate the classification accuracy (percentage of correctly predicted class labels), sensitivity (TP/(TP+FN) and specificity (TP/(TP+FP)).

Tutorial

  • Set Type to SVM LOOC
  • Press Select mode
  • Group/Colour by to "Secondary group", to run a SVM LOOC for the secondary group variable (sampling time point before (May) and after (December) malaria season.
  • Set Filter to 20. SVM LOOC is a time consuming analysis. To test this function, limit the number of included features (malaria proteins) to 10 to speed up the computation.
  • Press Draw Chart

The top 20 antigens are able to predict sampling time point with 69% accuracy. 81% of samples collected in December were correctly classified into this class (sensitivity = 0.89). 57% of samples collected in May were correctly classified into this class (sensitivity = 0.57).


DSSVM.png

Multivariable linear regression

The Regression Page facilitates multivariable linear regression. Multivariable linear regression is a powerful techniques that can identify complex associations between measurements (data matrix) and multiple factors. Multiple co-variates (variables controlled by the experimenter or confounding variables) can be included in the analysis.

Identify feature-factor associations by multivariate regression

  • Open the Regression Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set "Regress by" to "Features vs Factors" and press "Set Mode"
  • Set Filter to 30 to restrict the analysis to the top 30 features with the highest mean across samples
  • Press Run Analysis

The displayed table shows associations between multiple factors (as defined in the meta information file) and features (as provided by the uploaded data matrix). For each feature a regression model is fit, including the feature as dependent variable and all factors as explanatory variables:

feature = fa1 + fa2 + fa3 …,

where fa1, fa2, ... are the factors (all factors defined in the meta annotation file). P-values are shown for each factor-feature combination, indicating the significance of associations. Click on the header to sort the displayed table, e.g. by p-value.

The following results table is obtained for the demo project. The table was sorted by p-values obtained for factor "parasitemic" (by clicking on Parasitemic.p). Parasitemic (0/1) represents if subjects were negative or positive for the malaria parasite. A number of antigens are significantly associated with this factor (p<0.05), indicated in blue. Thirteen antigens are still significantly associated with the parasitemic factor after correction for multiple testing by FDR (column Parasitemic.p.fdr). Significant associations for other factors can be viewed by re-sorting the table.

DSRegressionTable.png

Explore associations between single feature and all factors by multivariable regression

Host immune response to different malaria proteins is highly correlated. It is therefore not an easy task to assess if immune response to a specific protein is protective against malaria . Complex associations of this type can be explored by multivariate regression.

  • Open the Regression Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set "Regress by" to "Malaria Episodes May.December" to explore associations between the number of malaria episodes between Many and December and host immune response to malaria proteins
  • Press "Set Mode"
  • Press Run Analysis
  • Click on "P" in the generated table to order data by p-value

For each malaria protein (feature) the table shows the correlation R between host immune response to that protein and the number of malaria episodes between May and December. A p-value is given, indicating if the observation correlation is significant. Additionally, the table presents the mean signal intensity (immune response) across all samples and the number of positive samples (immune respond >0).

The top four proteins with the highest abs(R2) (absolute of correlation coefficient) are selected and the association of these proteins with the number of malaria episodes is examined by multivariable linear regression. This model incorporates malaria episodes as dependent variable and the top four malaria proteins as explanatory variables: number of episodes ~ protein 1 + protein 2 + protein 3 + protein 4

The results are presented as figure. For each included protein the p-value computed by multivariate regression is reported, indicating if the protein is significantly associated with the number of malaria episodes. Additionally, a scatter plot is shown, plotting the signal intensity of that protein (host immune response) versus the number of malaria episodes between May and December.

DSRegressionEpisodes.png

Details Multivariable Linear Regression

Details

Feature selection

The feature selection page provides standard feature selection methods to identify relevant features associated with an explanatory variable or confounding factor. Feature selection methods are based on the premise that data frequently contains many features that are either redundant or irrelevant, and can thus be removed without incurring much loss of information[2]. Feature selection algorithms identify new feature subsets that best predict an outcome of interest (in this case a factor defined in the meta annotation file).


Identify feature-factor associations by multivariable step-wise regression

The following tutorial identifies the subset of relevant malaria proteins predicting the number of patient malaria episodes between May and December.

  • Open the Feature Selection (FS) Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set Method to "Step-wise regression" and press "Select Mode"
  • Set Group by to Malaria Episodes May.Dec"
  • Set Direction by to Forward & Backward"
  • Press Run Analysis

The displayed figure shows the number of features included in the analysis, the number of selected relevant features, the AIC of the final model and the Area Under the Curve (AUC) of the final model (the final model selected by step-wise regression). Additionally, the selected features are presented. The plot on the lower left shows the number of malaria episodes vs. the number of malaria episodes predicted by the final regression model (the final regression model including all selected features).

Available methods:

  • Step-wise regression: step-wise regression implemented using the step() R funding from the stats package
  • LASSO regularised regression: LASSO performs both feature selection and regularisation to prevent overfitting. The method is implemented via the cv.glmnet() R function from the glmnet package.
  • Random Forest: Feature selection by random forest, implemented via the randomForest() R function from the randomForest package

Network analysis

Correlations between features can be presented as Network. Positive correlations are shown as yellow edges, negative correlations as blue edges and features as nodes. Network analysis identifies co-occurring and mutual exclusive features and clusters of correlating features.

Tutorial

  • Open the Network Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set Type to Network+ to do a network analysis of all features and all factors defined in the uploaded meta information file.
  • Set color to select a black or white background colour
  • Correlation coefficient can be set to either Pearson's correlation or Spearman coefficient
  • Press Draw Chart


Only correlations larger than Edge Min Similarity or smaller than -1 * Edge Min Similarity are presented as edges.

The following network is obtained for the demo project. Immune response to malaria proteins (nodes) is highly positively correlated (yellow edges) and most malaria proteins form one dense cluster.

DSNetwork.png

Biomarker Discovery

DataSmart provides simple yet powerful methods for biomarker discovery. Predictive biomarkers associated with two sample groups (e.g. cases and controls; responders and non-responsers) are identified by t-test, Wilcoxon rank test, nested anova, logistic regression or the random forest classifier. The discriminatory power of biomarker candidates is described by the area under the ROC curve (AUC), odds ratio, delta (difference in means in units of standard deviation) or fold change.

Biomarker discovery is only possible for variables/factors with exactly two different values (e.g. cases and control).

Tutorial

  • Open the Biomarker Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set "Group by" to Secondary Group
  • Set Filter to 30 to only include the top 30 features with highest mean value across samples
  • Press Draw Chart


For each feature a p-value, adjusted p-values (FDR and Bonferroni), Area Under the ROC Curve (AUC with 95% upper and lower confidence intervals), odds ratio (with 95% upper and lower confidence intervals), delta (difference in mean divided by standard deviation), and fold change are calculated. Odds ratios are visualized as forest plots. Together these values indicate if each feature is a potential predictive biomarker for classifying samples into the two classes.

If test is set to "LogisticRegression", p-value are calculated by logistic regression, incorporating the selected group as dependent variable and all factors as explanatory variables (group ~ feature + factor 1 + factor 2 + factor 3 + .... All factors defined in the meta annotation file are included. Also AUC and odds ratio will be adjusted for all factors. In detail: odds ratios are calculated by logistic regression, all factors are incorporated as explanatory variables.

The following plot shows the forest plot obtained for the demo project.


DSBiomarker.png

Paired analysis

The Pairwise Page facilitates paired statistical analysis. Paired analysis makes use of paired study designs, where several samples where taken from the same individual (e.g. before and after treatment). Comparisons are done by paired t-test or paired Wilcoxon rank test.

Tutorial

  • Open the Pairwise Page. Make sure that the Demo Project has been loaded first, as described above.
  • Set Type to Scatter Plot
  • Press Select Mode
  • Set secondary group G2 to 2.After
  • Select test for paired analysis (paired Wilcoxon rank test or paired t-test)
  • Press Draw Chart

A scatter plot is shown. The x and y-axis' represent the values of the two selected secondary groups. Each dot represents one feature of the data matrix. In the demo project, features are malaria proteins, the secondary group represents the time point of sample collection (before and after malaria season) and data values represent measured signal intensities, representing host immune response to malaria proteins. A p-value is provided for each primary group, indicating if measurements are significantly different between the two compared time points (secondary group). The following plot is obtained for the demo project. Dots are significantly shifted towards the upper left, indicating that signal intensities (host immune response to malaria proteins) are generally higher after the malaria season (y-axis). This shift is significant for each of the three age groups, as indicated by the p-values at the top of the plot.

DSPaired.png


Details

General Options

The following table described general DataSmart options that are present on most analysis pages.

FieldDescription
Figure FormatThe file format of generated figures (PNG, PDF or SVG).
OrderThe order of samples in the generated figures. Samples can be ordered by their primary group, secondary group, pair, or label.
FilterSelects how many of the top features (highest mean across samples) are included. Set to 0 to include all features.
ColorColor palette for plots
Secondary GroupUsed to filter samples by their secondary group. Select "All" to include all samples.
DistanceDistance metric for computing pair-wise distances of samples.
ResolutionResolution of the generated plot in dpi. Allowed range: 20-1000
WidthWidth of generated plot in mm.
HeightHeight of generated plot in mm.

Implementation

The DataSmart web-frontend is implemented in Java using the JavaServer Faces architecture. Interactive views are facilitated by the Javascript library D3.js. The backend is implemented in Perl and the R statistical programming language. No installation, configuration, registration or login is required. Data is kept privately and cannot be viewed by other users. Uploaded data and calculated results are deleted after the users session terminates.


FAQ

Error message: An internal error occurred, likely you either didn't upload a data and meta data file or your session has expired

Before using DataSmart, you need to upload a data and meta data file. Select Home in the top menu and upload a data and meta data file or press "Start Demo Project". Make sure that you have activated java script in your browser settings and allow cookies. If you receive this error message after pressing "Start Demo Project" your browser does not allow cookies.

You also receive this error message if your session has expired. Your session expires if you haven't used DataSmart for more than 60 minutes.

Figure labels are only partially displayed

Increase the figure width and height.

CCA changes if "Color by" is changed, but CCA+ does not

CCA provides a p-value if variance observed in the community composition can be explained by the sample groups. The sample grouping can be selected by "Color by". CCA+ describes if variance in the community composition can be attributed to environmental variables. CCA+ does not include the sample groups for the statistical analysis. However, samples are colored by the grouping selected under "Color by".

Figure labels overlap

Increase the figure width and height.

Figure legend overlaps with chart

Increase the figure width and height.

Figures are completely white

Solution: Reduce resolution of figure or increase figure width and height.

Error message: Internal ERROR: null dataMatrix

Likely you forgot to upload a data and/or meta data file. Select Home in the top menu and upload a data and meta data file.

Error message: j_id_id15:resolution: Validation Error: Value is not of the correct type

You have entered an incorrect value in one of the text fields. Valid value ranges are: Resolution: 20-1000 Width: 20-10000 Height: 20-10000 Min proportion: 0-100, real numbers are supported, e.g. 0.2