1. Introduction
CVCDAP stands for Cancer Virtual Cohort Discovery Analysis Platform. It can help perform rapid and effective cohort-level re-discovery for your research questions using the following advanced features.
- It implements an advanced query interface allowing
flexible selection of patients sharing common characteristics
across multiple studies and/or cancer types, by a free
combination of histological, pathological and molecular
criteria, as a virtual cohort for investigation.
- It provides dozens of built-in customizable tools for seamless
molecular and clinical analysis of a single cohort or comparisons
of two cohorts with relevance.
-
Users can create their own Project by uploading somatic mutations and clinical data, to be seamless integrated with exisitng public projects in CVCDAP, for cohort query and analysis.
2. Define Virtual Cohorts
2.1 Query cohort
CVCDAP allows users to create a virtual cohort by selecting samples with a free combination of histological, pathological and molecular criteria. The selected samples can be a subset of one study, or result from the combination of multiple studies.
- Tissue filter: Allow multiple selection with Primary Sites, Disease Types, or Project IDs.
- Clinical filter: Allow multiple selection of clinical variables including Stage, Grade, Gender, Race and Age.
- Molecular filter: Allow multiple selection of molecular features including mutated genes, copy number alteration, mRNA/Protein expression deviation level, and TMB level (Tumor mutational burden).
2.2 Upload Cohort
A uploaded list of patient/sample IDs can be queried in the CVCDAP and matched samples will be used to define a new cohort.
2.3 Cohorts Operation
Two defined cohorts could be further operated in this page to create a new cohort for analysis. Available operations include intersection (common patients in both cohorts), union (all patients of two cohorts) and subtraction (patients only in first cohort).
3. Single Cohort Analysis
3.1 Genomic analyses toolbox
Plots a sample-level overview for top mutated genes identified in a given cohort. Each column represents a sample and each row is a gene. The top barplot shows the frequency of mutations for each sample, while the right barplot represents the frequency of mutations in each gene. By default, samples will be ordered by the most mutated genes.
Parameters:
- Cohort: Select a cohort for analysis.
-
topN: The number of top mutated genes to be included in the oncoplot. Default is 20, but if the parameter "selectedGenes" was provided, this parameter will be omitted.
- selectedGenes: A set of given genes to be included in
the oncoplot. At least two genes (HNGC symbols) are
needed. Optional.
- Font size (Title) : Font size for title. Default is 15.
- Font size (Legend): Font size for legend. Default is
12.
This tool allows users to plot prevalence of mutations in given gene(s) for each project included in the analysis cohort.
Parameters:
- Cohort: Select a cohort for analysis.
- Genes: Provide one or two genes to check mutation status. Required.
- Width: Width of the output plot. Default is 7.
- Height: Height of the output plot. Default is 7.
This tool plots and compares TMB distribution between mutant group (MT)with somatic mutations in gvien gene(s) and Wild Type (WT) group of patients.
Parameters:
- Cohort: Select a cohort for analysis.
- Genes: Provide one or two gene symbols to select patients with somatic muations in the given gene(s) as mutatnt (MT) group. Required.
-
Project: If checked, the distriubtion will be shown for each indiviudal project included in the cohort to be analyzed.
-
Test method: The available methods include “wilcox test” and “t test” , and default option is “wilcox test”. If the parameter "Project" was checked, this parameter will be omitted.
- Width: Width of the output plot. Default is 3.
- Height: Height of the output plot. Default is 4.
This tool illustrates contribution of known mutational
processes to somatic mutations identified in a cohort, which helps to understand associated mutational features and aetiology.
Parameters:
- Cohort: Select a cohort for analysis.
- ReferenceSigs : Selecte a reference signature set to be
used in the analysis. "signatures (Nature 2013)" or "signatures
(COSMIC)" are available. Required.
- TriCountsMethod: Set to either: 'default' – no further
normalization. 'exome' – normalized by number of times each
trinucleotide context is observed in the exome 'genome' –
normalized by number of times each trinucleotide context is
observed in the genome. 'exome2genome'– multiplied by a ratio of
that trinucleotide's occurence in the genome to the
trinucleotide's occurence in the exome 'genome2exome' –
multiplied by a ratio of that trinucleotide's occurence in the
exome to the trinucleotide's occurence in the genome data frame
containing user defined scaling factor – count data for each
trinucleotide context is multiplied by the corresponding value
given in the data frame.
This tool allows users to determine if query genes are
mutually exclusively mutated or co-occurring by pair-wise Fisher’s
Exact test.
Parameters:
- Cohort: Select a cohort for analysis.
- topN: The number of top mutated genes to be included in
the plot. Default is 25, but if the parameter "Genes" were
provided, this parameter will be omitted.
- Genes: A set of given genes, among which interactions
will be tested. At least five genes are required. If not
provided, the test will be performed between top 25
genes.
- Upper P-value: Upper p-value threshold. Default is 0.05.
- Lower P-value: Lower p-value threshold. Default is 0.01.
This tool identifies potential driver genes in a given cohort, based on positional clustering by OncodriveCLUST and visualized by a scatter plot, in which the
size of a point represents the number of clusters found in a given gene . Genes satisfying FDR thresholds are coloured in red, with others in blue.
Parameters:
- Cohort: Select a cohort for analysis.
- MinMut: Specify the minimal number of somatic mutations identified in
a gene across the given cohort for being included in analysis. Default is 5.
- FDR cutoff: Specify the FDR threshold for being cosindered as statistically significant after multiple test correction.
This tool illustrates mutation frequency along protein
structure by a lollipop plot, which could help identify mutation hotspots visually.
Parameters:
- Cohort: Select a cohort for analysis.
- Gene(s): Provide genes (HGNC symbols) to be plotted.
Required.
Highlights hyper-mutated genomic regions, with showing inter variant distance on a linear genomic scale, by a rainfall plot. The rainfall plot can be seen as a scatter plot showing the location of events on the x-axis versus the distance between consecutive events on the y-axis, which is mostly used for illustrating the distribution of somatic mutations along a reference genome, typically to identify events occurring at high frequency over very short distances.
Parameters:
- Cohort: Select a cohort for analysis.
- Tumor Sample Barcode: Specify a sample ID (Tumor_Sample_Barcode) to be analyzed. If not specified, the sample with most mutations in the cohort will be used.
This tool shows variant allele frequencies (VAF) of identified somatic mutations in a boxplot, which could help estimate clonal status of given genes. In the output figure, each point represents a mutation. VAF is a measure of diploid zygosity. Without considering tumor purity and copy number alteration issues, VAF of a heterozygous loci will be near 50%, VAF of homozygous loci will be near 100%.
Parameters:
- Cohort: Select a cohort for analysis.
- topN: The number of top mutated genes to be included in
the analysis. Default is 10, but if the parameter "Genes" were
provided, this parameter will be omitted.
- Gene(s): Provide genes (HGNC symbols) to be analyzed.
Optional.
This tool shows overall distribution of six types of mutational conversions including transitions (Ti) and transversions (Tv), as well as
fraction of conversions in each sample.
Parameters:
- Cohort: Select a cohort for analysis.
This tool plots a concise sample-level mutational overview for a given set of
genes.
Parameters:
- Cohort: Select a cohort for analysis.
- topN: The number of top mutated genes to be included in
the oncoplot. Default is 5, but if the parameter "Genes"
were provided, this parameter will be omitted.
- Genes: Provide a set of given genes to be included in the
oncoplot. At least two genes (HNGC symbols) are needed.
Optional.
- Font size (Title): Font size for title. Default is 15.
- Font size (Legend): Font size for legend. Default is 12.
This tool plots potential druggable mutated genes
in categories, using gene druggability information from Drug-Gene
Interaction database.
Parameters:
- Cohort: Select a cohort for analysis.
- topN: Specify the number of top mutated genes to be included in
the analysis. Default is 20, but if the parameter "Genes"
were provided, this parameter will be omitted.
- Genes: Provide genes (HGNC symbols) to be analyzed.
Optional.
This tool illustrates gene mutation frequency in a cohort by a Word Cloud.
Size of each gene is proportional to the number of samples harbouring somatic mutations in this gene in the given cohort.
Parameters:
- Cohort: Select a cohort for analysis.
- minMutSamples: The minimal number of samples that a gene
is mutated in the given cohort. Default is 3.
- topN: The number of top mutated genes that will be
plotted. Optional.
3.2 mRNA analysis toolbox
This tool performs principal component analysis (PCA), a dimensionality reduction and visualization method, and generate a 2D plot positioning each sample
with respect to the first two principal components. It could help to find patterns without reference to prior knowledge, and helps to spot outlier samples.
Parameters:
- Cohort: Select a cohort for analysis.
- Genes: Provide a set of genes for PCA analysis. If not
provided, all genes will be used for analysis.
- Label group: Each point (sample) will be colored
according to the group that sample belonging to.
- Ellipse: Indicate if an ellipse will be ploted for each
group of samples. Default is TRUE.
This tool performs T-distributed Stochastic Neighbor
Embedding (t-SNE) for visualization, another popular dimensionality reduction and visualization method.
Parameters:
- Cohort: Select a cohort for analysis.
- Genes: Provide a set of genes for t-SNE analysis. If not
provided, all genes will be used.
- Label group: Each point (sample) will be colored
according to the group that sample belonging to.
- Size (Point): Size of the points.
This tools performs both K-means clustering and Hierarchical clustering analysis for a given cohort using gene expression profiles, with cluster groups visualized using t-SNE method.
Parameters:
- Cohort: Select a cohort for analysis.
- Genes: Provide a set of genes for clustering analysis. If
not provided, all genes will be used.
- k cluster: The pre-defined number of clusters to be identified.
3.3 Protein analysis toolbox
This tool performs principal component analysis (PCA) for a
given list of proteins, and generate a 2D plot positioning each sample with respect to the first two principal components. It could help to find patterns without reference to prior knowledge, and helps to spot outlier samples.
Parameters:
- Cohort: Select a cohort for analysis.
- Genes: Provide a set of proteins for PCA analysis. If not
provided, all proteins will be used.
- Label group: Each point (sample) will be colored
according to the group that sample belonging to.
- Ellipse: Indicate if an ellipse will be ploted for each
group of samples. Default is TRUE.
This tool performs T-distributed Stochastic Neighbor
Embedding (t-SNE) for visualization, another popular dimensionality reduction and visualization method.
Parameters:
- Cohort: Select a cohort for analysis.
- Genes: Provide a set of proteins for t-SNE analysis. If
not provided, all proteins will be used.
- Label group: Each point (sample) will be colored
according to the group that sample belonging to.
- Size (Point): Size of the points. Default is 3.
This tools performs both K-means clustering and Hierarchical clustering analysis for a given cohort based on relative protein abundances, with cluster groups visualized using t-SNE method.
Parameters:
- Cohort: Select a cohort for analysis.
- Genes: Provide a set of proteins for clustering analysis.
If not provided, all proteins will be used.
- k cluster: The pre-defined number of clusters for
k-means anlysis.
3.4 Clinical Analysis Toolbox
A given cohort will be stratified into two groups of
patients by a provided molecular feature or clinical parameter, and
Kaplan-Meier curves will be plotted and survival difference (OS,
DSS, PFI, DFI) will be evaluated by Log-Rank test.
Parameters:
- Cohort: Select a cohort for analysis.
- Grouped by: Patients will be stratified into different
groups based on the selected clinical or molecular feature.
- Gene: Provide a gene symbol.
- Cutoff: Samples with expression level higher than the
given cutoff will be considered as high-expression group,
while the others regarded as low expression cohort.
Perform multi-variate survival analysis to evaluate the effect of multiple variables on survival simultaneously to control confounding variables. The results will be visualized by a forest plot.
Parameters:
- Cohort: Select a cohort for analysis.
- Endpoints: Select a type of survival outcome to be
investigated.
OS is the period from the date of diagnosis until the date of death from any cause.
DSS is the time from the date of initial diagnosis until the date of
death from the disease.
PFI is the period from
the date of diagnosis until the date of the first occurrence of
a new tumor event.
DFI is the period from the
date of diagnosis until the date of the first new tumor
progression event subsequent to the determination of a patient’s
disease-free status after their initial diagnosis and treatment.
- Clinical variable: Select one or mutliple clinical
variables as confouding factor(s). Optional.
- TMB type: Indicate whether only nonsilent mutations or all
mutations will be used in calculating tumor mutational burden (TMB).
- Gene: Input a gene symbol.
- Cutoff: Samples with expression level higher than the
given cutoff will be considered as high-expression group,
while the others as low expression cohort.
4. Two Cohort Analysis
4.1 Genomic Analysis Toolbox
This tool identifies differentially mutated genes between
two cohorts with results visualized as forest plot.
Parameters:
- Cohort 1: Select the first cohort for analysis.
- Cohort 1 label: Display name in the output. Optional.
- Cohort 2: Select the second cohort for analysis.
- Cohort 2 label: Display name in the output. Optional.
- minMutSamples: The minimal number of samples harbouring muations in a specific gene in either cohort. Default is 5.
- P-value: P-value threshold. Default is 0.05. Notice: if
the parameter "FDR" was provided, this parameter will be
omitted.
- FDR: FDR threshold. If provided adjusted P-values will
be used. Optional.
- Color 1: Select a color for the output of Cohort 1.
Optional.
- Color 2: Select a color for the output of Cohort 2.
Optional.
- Font size (Gene Name): Font size for gene symbols in the
plot. Default is 1.2
This tool generates onco-plots for two cohorts, and
visualized side by side for better comparison of mutational landscape of two cohorts.
Parameters:
- Cohort 1: Select the first cohort for analysis.
- Cohort 1 label: Display name in the output. Optional.
- Cohort 2: Select the second cohort for analysis.
- Cohort 2 label: Display name in the output. Optional.
- Genes: Genes (HGNC symbols) will be plotted. Default is
top 5 mutated genes from each cohort.
- Font size (Gene Name): Font size for gene names. Default
is 10.
- Font size (Legend): Font size for legend. Default is 10.
- Font size (Title) : Font size for title. Default is 10.
- Font size (Height): Height of the graphics region in
inches. Default is 7.
This tool illustrates mutation frequency along
protein structure of a given gene by a customized lollipop plot for better
comparison between two cohorts.
Parameters:
- Cohort 1: Select the first cohort for analysis.
- Cohort 1 label: Display name in the output. Optional.
- Cohort 2: Select the second Cohort for analysis.
- Cohort 2 label: Display name in the output. Optional.
- Gene: Input a gene (HGNC symbols) to be plotted.
Required.
4.2 mRNA Analysis Toolbox
This tool identifies Differentially Expressed Genes (DEGs) between two cohorts using limma, and results will output as a table.
Parameters:
- Cohort 1: Select the first cohort for analysis.
- Cohort 2: Select the second cohort for analysis.
- FDR: The cutoff of adjusted p-values. Default is 0.05.
- log2FC: The cutoff of log2 (fold change). Default is 1,
i.e. 2 fold change.
Explaination of output table (limma result):
- logFC: log 2 fold change
- AveExpr: average log 2 expression level
- t: moderated t-statistic
- FDR: p-value adjusted by Benjamini and Hochberg method to controal the false positive rate
- B: B statistic, indicating log-odds that the gene is differentially expressed
This tool identifies Differentially Expressed Genes (DEGs) between two cohorts, and
visualizes results as a heatmap with colors proportional to gene
expression levels.
Parameters:
- Cohort 1: Select the first cohort for analysis.
- Cohort 2: Select the second cohort for analysis.
- FDR: The cutoff of adjusted p-values. Default is 0.05.
- log2FC: The cutoff of log2 (fold change). Default is 1,
i.e. 2 fold change.
- kmeans_k: The number of pre-defined clusters for
k-means. If provided, rows will be aggregatd in the heatmap; otherwise, the rows will be not be aggregated. Optional.
- clusterRows: Indicate if rows will be clustered.
- clusterColumns: Indicate if columns will be clustered.
- showColNames: Indicate if column names will be shown.
- Title: The title of the plot.
- Fontsize: Base font size for the plot.
This tool draws a volcano plot to enable quick visual
identification of genes with statistical significance and
magnitude of expression change.
Parameters:
- Cohort 1: Select the first cohort for analysis.
- Cohort 2: Select the second cohort for analysis.
- FDR: The cutoff of adjusted p-values.Horizontal line
will be drawn at -log10(FDR). Default is 0.05.
- log2FC: The cutoff of log2 (fold change). Vertical lines
will be drawn at the negative and positive values of log2FC.
Default is 1, i.e. 2 fold change.
- highlightedGenes: Provide gene(s) that will be highlighted in the plot if satisfying both given
FDR and log2FC criteria. Optional.
- Draw Connectors: Indicate whether to fit labels onto plot and connect to
their respective points by lines. DEFAULT is FALSE.
This tool performs Gene Set Enrichment Analysis (GSEA)
analysis to determine whether a priori defined set of genes
relating to the molecular mechanisms and biological processes,
shows statistically significant and concordant differences
between two cohorts. Down-regulated (NES < 0) and up-regulated (NES > 0) pathways in cohort 1 will be visualized as two barplots side by side (NES: normalized enrichment scores).
Parameters:
- Cohort 1: Select the first cohort for analysis.
- Cohort 1 label: Display name in the output. Optional.
- Cohort 2: Select the second Cohort for analysis.
- Cohort 2 label: Display name in the output. Optional.
- Gene sets: Select a collection of gene sets from the
Molecular Signature Database (MSigDb).
- Metric: Select the metric that GSEA will use to
calculate genes differential expression with respect to the two
phenotypes. Default is Ratio_of_Classes.
- Median: Indicate if the median of each class will be
instead of the mean, for the class separation metrics. Default
is false.
This tool draw boxplots with jitter to compare expression
level of a query gene between two cohorts, and the statistical
significance of differences will be evaluated by t-test or
Wilcoxon rank sum test.
Parameters:
- Cohort 1: Select the first cohort for analysis.
- Cohort 1 label: Display name in the output. Optional.
- Cohort 2: Select the second cohort for analysis.
- Cohort 2 label: Display name in the output. Optional.
- Gene: Provide a gene symbol. Required.
4.3 Protein Analysis Toolbox
This tool identifies Differentially Expressed Proteins (DEPs) between two cohorts using limma, and results will output as a table.
Parameters:
- Cohort 1: Select the first Cohort for analysis.
- Cohort 2: Select the second Cohort for analysis.
- FDR: The cutoff of adjusted p-values. Default is 0.05.
- log2FC: The cutoff of log2 (fold change). Default is 1,
i.e. 2 fold change.
Explaination of output table (limma result):
- logFC: log 2 fold change
- AveExpr: average log 2 expression level
- t: moderated t-statistic
- FDR: p-value adjusted by Benjamini and Hochberg method to controal the false positive rate
- B: B statistic, indicating log-odds that the gene is differentially expressed
This tool identifies Differentially Expressed Proteins (DEPs) between two cohorts, and
visualizes results as a heatmap with colors proportional to relative proteins
abundance level.
Parameters:
- Cohort 1: Select the first cohort for analysis.
- Cohort 2: Select the second cohort for analysis.
- FDR: The cutoff of adjusted p-values. Default is 0.05.
- log2FC: The cutoff of log2 (fold change). Default is 1,
i.e. 2 fold change.
- kmeans_k: The number of pre-defined clusters for
k-means. If provided, rows will be aggregatd in the heatmap; otherwise, rows will be not be aggregated. Optional.
- clusterRows: Indicate if rows will be clustered.
- clusterColumns: Indicate if columns will be clustered.
- showColNames: Indicate if column names will be shown.
- Title: The title of the plot
- Fontsize: Base font size for the plot
This tool draws a volcano plot to enable quick visual
identification of proteins with statistical significance and magnitude of protein abundance change.
Parameters:
- Cohort 1: Select the first cohort for analysis.
- Cohort 2: Select the second cohort for analysis.
- FDR: The cutoff of adjusted p-values.Horizontal line
will be drawn at -log10(FDR). Default is 0.05.
- log2FC: The cutoff of log2(fold change). Vertical lines
will be drawn at the negative and positive values of log2FC.
Default is 1, i.e. 2 fold change.
- highlightedGenes: Provide protein(s) that will be highlighted in the plot if satisfying both given
FDR and log2FC criteria. Optional.
- Draw Connectors: Fit labels onto plot and connect to
their respective points by lines (TRUE/FALSE). Default is FALSE.
This tool performs Gene Set Enrichment Analysis (GSEA)
analysis to determine whether a priori defined set of proteins
relating to the molecular mechanisms and biological processes,
shows statistically significant and concordant differences
between two cohorts. Down-regulated (NES < 0) and up-regulated (NES > 0) pathways in cohort 1 will be visualized as two barplots side by side (NES: normalized enrichment scores).
Parameters:
- Cohort 1: Select the first cohort for analysis.
- Cohort 1 label: Display name in the output. Optional.
- Cohort 2: Select the second Cohort for analysis.
- Cohort 2 label: Display name in the output. Optional.
- Gene sets: Select a collection of gene sets from the
Molecular Signature Database (MSigDb).
- Metric: Select the metric that GSEA will use to
calculate protein differential expression with respect to the
two phenotypes. Default is Ratio_of_Classes.
- Median: Indicate if the median of each class will be
instead of the mean, for the class seperation metrics. Default
is FALSE.
This tool draw boxplots with jitter to compare relative abundance
of a query protein between two cohorts, and the statistical
significance of differences will be evaluated by t-test or
Wilcoxon rank sum test.
Parameters:
- Cohort 1: Select the first cohort for analysis.
- Cohort 1 label: Display name in the output. Optional.
- Cohort 2: Select the second Cohort for analysis.
- Cohort 2 label: Display name in the output. Optional.
- Gene: Provide a gene symbol. Required.
4.4 Clinical Analysis Toolbox
This tools generates Kaplan-Meier curves illustrating
survival difference (OS, DSS, PFI, DFI) between two cohorts, which can
be further stratified by a molecular feature or clinical
parameter. The Log-Rank test is applied to evaluate statistical
significance.
Parameters:
- Cohort 1: Select the first cohort for analysis.
- Cohort 1 label: Display name in the output. Optional.
- Cohort 2: Select the second cohort for analysis.
- Cohort 2 label: Display name in the output. Optional.
- Grouped by: Patients can be stratified into different
groups using the selected clinical or molecular feature. If "cohort" is selected,
two cohorts will be directly compared without stratificaiton.
- Gene: Provide a gene symbol.
- Cutoff: Samples with expression level higher than the
given Cutoff will be considered as high-expression group,
while the others as low expression cohort.
Perform multi-variate survival analysis to evaluate the effect of multiple variables on survival simultaneously to control confounding variables. The results will be visualized by a forest plot.
Parameters:
- Cohort 1: Select the first cohort for analysis.
- Cohort 1 label: Display name in the output. Optional.
- Cohort 2: Select the second cohort for analysis.
- Cohort 2 label: Display name in the output. Optional.
- Endpoints: Select a type of survival outcome to be
investigated.
OS is the period from the date of
diagnosis until the date of death from any cause.
DSS
is the time from the date of initial diagnosis until the date of
death from the disease.
PFI is the period from
the date of diagnosis until the date of the first occurrence of
a new tumor event.
DFI is the period from the
date of diagnosis until the date of the first new tumor
progression event subsequent to the determination of a patient’s
disease-free status after their initial diagnosis and treatment.
- Clinical variable: Select one or mutliple clinical
variables as confouding factor(s). Optional.
- TMB type: Select whether only nonsilent mutations or all
mutations will be used in calculating Tumor Mutational Burden.
- Gene: Provide a gene symbol.
- Cutoff: Samples with expression level higher than the
given cutoff will be considered as high-expression group,
while the others as low expression cohort.
5. Analysis History
6. Saved Cohorts
Please note: User created cohorts will be permanently preserved for registered users, while 30 days for guest users.
7. Upload Your Project
You can create your own Project by uploading somatic mutations and clinical data here, which will be sealmess integrated with exisitng public projects in CVCDAP, for cohort query and analysis.
Note: If you want some public datasets/studies with a large amount of multi-omics data to be avaiable in CVCDAP, no hesitate to let us know and we will prioritize them for integration in the future release.
Upload Your Project
Manage Your Project
8. Browser Compatibility
Operating System |
Version |
Chrome |
Firefox |
Microsoft Edge
|
Safari |
Linux
|
Centos 7.2
|
not tested
|
38.3.0
|
n/a
|
n/a
|
MacOS
|
Catalina
|
80.0.3987.122
|
73.0.1
|
n/a
|
13.0.5
|
Windows |
10 |
80.0.3987.122
|
73.0.1
|
42.17134.1098.0
|
n/a
|
The web best run with browser Chrome Version 80.0.3987.122.