1. Introduction
CVCDAP stands for cancer virtual cohort discovery analysis
platform. It is an open-access data portal of a large variety of
cancer multi-omics data including public data and self-produced
data. The CVCDAP provides convenient data visualization and
computational tools for the analysis of customizable cohorts. All
the data and informatics tools are made freely available to a
wider community of cancer researchers.
2. Data source
We download clinical, somatic mutation, mRNA data of TCGA from the
NCI Genomic Data Commons Data Portal and proteome data from
CPTAC. Datasets from NCI Genomic Data
Commons Data Portal represent the most uniform attempt to systematically provide multiomics
data for TCGA tumors used for pancancer analysis. We only remained primary tumors.
Clinical Data : It was obtained from
https://api.gdc.cancer.gov/data/1b5f413e-a8d1-4d10-92eb-7c4ae739ed81,
we remained attributes like age, gender, race, stage, grade and mapped projectID, disease type, primary site
and calculated tumor mutation burden (TMB) (TMB_nonsyn and TMB_total).
Mutation: It was obtained from http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc,
which includes mutation calls for each sample.
RNAseq: It was obtained from http://api.gdc.cancer.gov/data/3586c0da-64d0-4b74-a449-5ff4d9136611.
We corrected gene symbols firstly and imputed by different disease type, then merged and convert fpkm to tpm,
and stored log2tpm expression signal of individual gene per sample into database.
Proteomics: At our present database, we only collected data produced by iTRAQ4 and TMT10.
After removing proteins whose Completeness less than 50%, we imputed by different studies,
then merged and just remained overlapped proteins, performed quantile normalization.
Normalized protein expression for each gene, per sample were stored in our database.
3. Define Virtual Cohorts
A virtual cohort is a custom study comprised of samples
from one or more existing studies. CVCDAP allows users to define
a custom cohort of samples that fit user specific tissue, genomic
or clinical criteria of interest. These samples can be a subset
of the data available in an existing study, or result from the
combination of multiple existing studies. This cohort of samples
can then be queried or explored just like a traditional study.
4. Single Cohort Analysis
4.1 Genomic analyses toolbox
This tool allows users to plot word cloud for mutated genes
Parameters:
- Cohort: Select a cohort
- minMut: Minimum number of samples in which a gene is
required to be mutated. Default 3.
- top: Just plot these top n number of mutated genes.
Default NULL.
This tool allows users to overview the genetic alterations
per sample in each gene by heatmap.
Parameters:
- Cohort: Select a cohort
- top: how many top genes to be drawn. defaults to 20.
- genes: Just draw oncoplot for these genes. Minimum two
genes required! Default NULL.
- titleFontSize: font size for title. Default 15
- legendFontSize: font size for legend. Default 12
This tool allows users to visualize mutations of any set of
genes
Parameters:
- Cohort: Select a cohort
- top: how many top genes to be drawn. defaults to 5.
- genes: Just draw oncoplot for these genes. Minimum two
genes required! Default NULL.
- titleFontSize: font size for title. Default 15
- legendFontSize: font size for legend. Default 12
This tool allows users to draw three boxplots, which shows
proportion of transitions and transversions, overall distribution
of six different conversions and fraction of conversions in each
sample.
Parameters:
- Cohort: Select a cohort
This tool allows users to plot Variant Allele Frequencies as a boxplot which
quickly helps to estimate clonal status of top mutated genes (clonal genes usually
have mean allele frequency around ~50% assuming pure sample)
Parameters:
- Cohort: Select a cohort
- Genes: Specify genes for which plots which will to be
generated.
- Top: If genes is NULL plots top n number of genes.
Defaults to 5.
This tool allows users to draw a lollipop plot -- a hybrid
between a scatter plot and a barplot, which shows mutation spots
on protein structure
Parameters:
- Cohort: Select a cohort
- Gene: HGNC symbol for which protein structure to be
drawn.
This tool allows users to draw a rainfall plot , which can
be seen as a scatter plot showing the location of events on the
x-axis versus the distance between consecutive events on the
y-axis.
Parameters:
- Cohort: Select a cohort
- Tumor Sample Barcode: Specify sample names
(Tumor_Sample_Barcodes) for which plotting has to be done. If
NULL, draws plot for most mutated sample.
- DetectChangePoints: If TRUE, detectes genomic change
points where potential kataegis are formed. Results are written
to an output tab delimted file.
This tool allows users to identify cancer genes. (driver)
Parameters:
- Cohort: Select a cohort
- MinMut: minimum number of mutations required for a gene
to be included in analysis. Default 5.
- FDR Cutoff: fdr cutoff to call a gene as a driver.
This tool allows users to determine the contribution of
known mutational processes.
Parameters:
- Cohort: Select a cohort
- SignaturesRef : Either a data frame or location of
signature text file, where rows are signatures, columns are
trinucleotide contexts. Set to either: "signatures.nature2013"
or "signatures.cosmic"
- TriCountsMethod: Set to either: 'default' – no further
normalization. 'exome' – normalized by number of times each
trinucleotide context is observed in the exome 'genome' –
normalized by number of times each trinucleotide context is
observed in the genome. 'exome2genome'– multiplied by a ratio of
that trinucleotide's occurence in the genome to the
trinucleotide's occurence in the exome 'genome2exome' –
multiplied by a ratio of that trinucleotide's occurence in the
exome to the trinucleotide's occurence in the genome data frame
containing user defined scaling factor – count data for each
trinucleotide context is multiplied by the corresponding value
given in the data frame.
- DetectChangePoints: If TRUE, detectes genomic change
points where potential kataegis are formed. Results are written
to an output tab delimted file.
This tool allows users to plot potential druggable gene
categories as a boxplot , which checks for drug–gene interactions
and gene druggability information compiled from Drug Gene
Interaction database
Parameters:
- Cohort: Select a cohort
- Top: Top number genes to check for. Default 20.
- Genes: Manually specify gene list.
This tool allows users to determine if query genes are
mutually exclusively altered by performs pair-wise Fisher’s Exact
test.
Parameters:
- Cohort: Select a cohort
- Top: Check for interactions among top 'n' number of
genes. Defaults to top 25. genes
- Genes: List of genes among which interactions should be
tested. Minimum 5 genes required! If not provided, test will be
performed between top 25 genes.
- Upper P-value: Upper p-value threshold. Default 0.05.
- Lower P-value: Lower p-value threshold. Default 0.01.
4.2 mRNA analysis toolbox
This tool allows users to conduct Principal Component
Analysis (PCA) to better visualize the variation present in a
cohort with a given gene list.
Parameters:
- Cohort: Select a cohort
- Gene List: Input a gene list
- Label: Optional clinical variable indicating the groups
that the samples belong to. If provided the points will be
colored according to groups.
- Circle: Draw a correlation circle? Default TRUE.
This tool allows users to perform T-distributed Stochastic
Neighbor Embedding (t-SNE) for visualization in a low-dimensional
space of two dimensions in a cohort with a given gene list.
Parameters:
- Cohort: Select a cohort
- Gene List: Input a gene list
- Label: Optional clinical variable indicating the groups
that the samples belong to. If provided the points will be
colored according to groups.
- Label Point Size: Size of the points used for the
labels.
This tools allows users to perform k-means clustering or
hierarchical cluster analysis on a given cohort.
Parameters:
- Cohort: Select a cohort
- Gene List: Input a gene list.
- k: The number of clusters.
4.3 Protein analysis toolbox
This tool allows users to conduct Principal Component
Analysis (PCA) to better visualize the variation present in a
cohort with a given gene list.
Parameters:
- Cohort: Select a cohort
- Gene List: Input a gene list
- Label: Optional clinical variable indicating the groups
that the samples belong to. If provided the points will be
colored according to groups.
- Circle: Draw a correlation circle? Default TRUE.
This tool allows users to perform T-distributed Stochastic
Neighbor Embedding (t-SNE) for visualization in a low-dimensional
space of two dimensions in a cohort with a given gene list.
Parameters:
- Cohort: Select a cohort
- Gene List: Input a gene list
- Label: Optional clinical variable indicating the groups
that the samples belong to. If provided the points will be
colored according to groups.
- Label Point Size: Size of the points used for the
labels.
This tools allows users to perform k-means clustering or
hierarchical cluster analysis on a given cohort.
Parameters:
- Cohort: Select a cohort
- Gene List: Input a gene list.
- k: The number of clusters.
4.4 Clinical Analysis Toolbox
This tool allows users to compare survival time (OS, DSS,
PFI, DFI) of patients divided into two subgroups by clinical
characterization, such as stage, or a gene if mutated, or the
gene expression.The survival of two subgroup patients was
compared and tested with the Log-Rank test.
Parameters:
- Cohort: Select a cohort
- Gene: Input a gene symbol
- Cutoff: Samples with expression level higher than this
threshold are considered as the high-expression cohort, others
as the low expression cohort.
Evaluate simultaneously the effect of several factors on survival and visualize the result as forest plot.
Parameters:
- Cohort: Select a cohort
- Gene: Input a gene symbol
- Cutoff: Samples with expression level higher than this
threshold are considered as the high-expression cohort, others
as the low expression cohort.
5. Two Cohort Analysis
5.1 Genomic Analysis Toolbox
This tool allows users to compare two different cohorts to
detect differentially mutated genes and visualize the results as
a forestplot.
Parameters:
- Cohort 1: Select the first Cohort.
- Cohort 2: Select the second Cohort.
- Name 1: Optional name for the first cohort.
- Name 2: Optional name for the second cohort.
- MinMut: Consider only genes with minimum this number of
samples mutated in at least one of the cohort for analysis.
Helpful to ignore single mutated genes. Default 5.
- P-value: p-value threshold. Default 0.05.
- FDR Cutoff: fdr threshold. Default NULL. If provided
uses adjusted pvalues.
- Color 1: Optional color for first cohort.
- Color 2: Optional color for first cohort.
- Gene Name Size: Font size for gene symbols. Default 1.2.
- Height: Height of plot to be generated. Default 5.
- Width: Width of plot to be generated. Default 6.
This tool allows users to display genomic alterations of
query gene as oncoplot and plot them side by side for better
comparison.
Parameters:
- Cohort 1: Select the first Cohort
- Cohort 2: Select the second Cohort
- Name 1: Optional name for the first cohort
- Name 2: Optional name for the second cohort
- Genes: Draw these genes. Default plots top 5 mutated
genes from two cohorts.
- Gene Name Size: Font size for gene names. Default 10
- Legend Font Size: Font size for legend. Default 10
- Title Font Size: Font size for legend. Default 10
- Height: Height of the graphics region in inches. Default
7.
- Width: Width of plot to be generated. Default 6.
This tool allows users to display mutation spots on protein
structure of query gene and plot them wisely for better
comparison
Parameters:
- Cohort 1: Select the first Cohort.
- Cohort 2: Select the second Cohort.
- Name 1: Optional name for the first cohort.
- Name 2: Optional name for the second cohort.
- Gene: HGNC symbol for which protein structure to be
drawn.
5.2 mRNA Analysis Toolbox
This tool allows users to apply custom p-value or FDR
cutoff thresholds to dynamically obtain differentially expressed
genes.
Parameters:
- Cohort 1: Select the first Cohort.
- Cohort 2: Select the second Cohort.
- FDR Cutoff: Cutoff value for adjusted p-values. Only
genes with lower p-values are listed.
- log2FC Cutoff: Minimum absolute log2-fold-change
required.
This tool allows users to plot heatmap, which are used to
to represent the level of expression of many genes across a
number of comparable samples.
Parameters:
- Cohort 1: Select the first Cohort.
- Cohort 2: Select the second Cohort.
- FDR Cutoff: Cut-off for adjusted p value. Horizontal
line will be drawn at -log10(pCutoff). DEFAULT = 0.05.
- log2FC Cutoff: Cut-off for absolute log2 fold-change.
Vertical lines will be drawn at the negative and positive values
of FCCutoff. DEFAULT = 2.0.
- kmeans_k: The number of kmeans clusters to make, if we
want to aggregate the rows before drawing heatmap. If NA then
the rows are not aggregated.
- Cluster Rows: Boolean values determining if rows should
be clustered or hclust object.
- Cluster Columns: Boolean values determining if columns
should be clustered or hclust object.
- Show Colnames: boolean specifying if column names are
shown.
- Title: The title of the plot
- Fontsize: Base fontsize for the plot
This tool allows users to draw a volcano plot. A volcano
plot is a type of scatterplot that shows statistical significance
(P value) versus magnitude of change (fold change). It enables
quick visual identification of genes with large fold changes that
are also statistically significant.
Parameters:
- Cohort 1: Select the first Cohort.
- Cohort 2: Select the second Cohort.
- FDR Cutoff: Cut-off for adjusted p value. Horizontal
line will be drawn at -log10(pCutoff). DEFAULT = 0.05
- log2FC Cutoff: Cut-off for absolute log2 fold-change.
Vertical lines will be drawn at the negative and positive values
of FCCutoff. DEFAULT = 2.0
- Genes: Only these genes that pass FCcutoff and pCutoff
thresholds will be labelled in the plot. DEFAULT = NULL
- Draw Connectors: Fit labels onto plot and connect to
their respective points by lines (TRUE/FALSE). DEFAULT = FALSE
- kmeans_k: The number of kmeans clusters to make, if we
want to aggregate the rows before drawing heatmap. If NA then
the rows are not aggregated.
- Cluster Rows: Boolean values determining if rows should
be clustered or hclust object.
- Cluster Columns: Boolean values determining if columns
should be clustered or hclust object.
- Show Colnames: boolean specifying if column names are
shown.
- Title: The title of the plot
- Fontsize: Base fontsize for the plot
This tool allows users to run GSEA. Gene Set Enrichment
Analysis (GSEA) is a computational method that determines whether
an a priori defined set of genes shows statistically
significant, concordant differences between two biological
states. (e.g. phenotypes).
Parameters:
- Cohort 1: Select the first Cohort.
- Cohort 2: Select the second Cohort.
- Name 1: Optional name for the first cohort
- Name 2: Optional name for the second cohort
- Gene: input a gene symbol.
This tool allows users to generate box plots with jitter
for comparing expression of query gene in two cohorts.
Parameters:
- Cohort 1: Select the first Cohort.
- Cohort 2: Select the second Cohort.
- Name 1: Optional name for the first cohort
- Name 2: Optional name for the second cohort
- Gene: input a gene symbol.
5.3 Protein Analysis Toolbox
This tool allows users to apply custom p-value or FDR
cutoff thresholds to dynamically obtain differentially expressed
genes.
Parameters:
- Cohort 1: Select the first Cohort.
- Cohort 2: Select the second Cohort.
- FDR Cutoff: Cutoff value for adjusted p-values. Only
genes with lower p-values are listed.
- log2FC Cutoff: Minimum absolute log2-fold-change
required.
This tool allows users to plot heatmap, which are used to
to represent the level of expression of many genes across a
number of comparable samples.
Parameters:
- Cohort 1: Select the first Cohort.
- Cohort 2: Select the second Cohort.
- FDR Cutoff: Cut-off for adjusted p value. Horizontal
line will be drawn at -log10(pCutoff). DEFAULT = 0.05.
- log2FC Cutoff: Cut-off for absolute log2 fold-change.
Vertical lines will be drawn at the negative and positive values
of FCCutoff. DEFAULT = 2.0.
- kmeans_k: The number of kmeans clusters to make, if we
want to aggregate the rows before drawing heatmap. If NA then
the rows are not aggregated.
- Cluster Rows: Boolean values determining if rows should
be clustered or hclust object.
- Cluster Columns: Boolean values determining if columns
should be clustered or hclust object.
- Show Colnames: boolean specifying if column names are
shown.
- Title: The title of the plot
- Fontsize: Base fontsize for the plot
This tool allows users to draw a volcano plot. A volcano
plot is a type of scatterplot that shows statistical significance
(P value) versus magnitude of change (fold change). It enables
quick visual identification of genes with large fold changes that
are also statistically significant.
Parameters:
- Cohort 1: Select the first Cohort.
- Cohort 2: Select the second Cohort.
- FDR Cutoff: Cut-off for adjusted p value. Horizontal
line will be drawn at -log10(pCutoff). DEFAULT = 0.05
- log2FC Cutoff: Cut-off for absolute log2 fold-change.
Vertical lines will be drawn at the negative and positive values
of FCCutoff. DEFAULT = 2.0
- Genes: Only these genes that pass FCcutoff and pCutoff
thresholds will be labelled in the plot. DEFAULT = NULL
- Draw Connectors: Fit labels onto plot and connect to
their respective points by lines (TRUE/FALSE). DEFAULT = FALSE
- kmeans_k: The number of kmeans clusters to make, if we
want to aggregate the rows before drawing heatmap. If NA then
the rows are not aggregated.
- Cluster Rows: Boolean values determining if rows should
be clustered or hclust object.
- Cluster Columns: Boolean values determining if columns
should be clustered or hclust object.
- Show Colnames: boolean specifying if column names are
shown.
- Title: The title of the plot
- Fontsize: Base fontsize for the plot
This tool allows users to run GSEA. Gene Set Enrichment
Analysis (GSEA) is a computational method that determines whether
an a priori defined set of genes shows statistically
significant, concordant differences between two biological
states. (e.g. phenotypes).
Parameters:
- Cohort 1: Select the first Cohort.
- Cohort 2: Select the second Cohort.
- Name 1: Optional name for the first cohort
- Name 2: Optional name for the second cohort
- Gene: input a gene symbol.
This tool allows users to generate box plots with jitter
for comparing expression of query gene in two cohorts.
Parameters:
- Cohort 1: Select the first Cohort.
- Cohort 2: Select the second Cohort.
- Name 1: Optional name for the first cohort
- Name 2: Optional name for the second cohort
- Gene: input a gene symbol.
5.4 Clinical Analysis Toolbox
This tool allows users to compare survival time (OS, DSS,
PFI, DFI) of patients in two cohorts.
Parameters:
- Cohort 1: Select the first Cohort.
- Cohort 2: Select the second Cohort.
- Name 1: Optional name for the first cohort
- Name 2: Optional name for the second cohort
This tool allows users to evaluate simultaneously the effect of several factors on survival and visualize the result as forest plot in two cohort.
Parameters:
- Cohort 1: Select the first Cohort.
- Cohort 2: Select the second Cohort.
- Name 1: Optional name for the first cohort
- Name 2: Optional name for the second cohort
- Endpoint: Four clinical survival outcome endpoints: Overall Survival (OS), Progression-Free Interval (PFI), Disease-Free Interval (DFI), and Disease-Specific Survival (DSS).
OS is the period from the date of diagnosis until the date of death from any cause.
DSS is the time from the date of initial diagnosis until the date of death from the disease.
PFI is the period from the date of diagnosis until the date of the first occurrence of a new tumor event.
DFI defined here is the period from the date of diagnosis until the date of the first new tumor progression event subsequent to the determination of a patient’s disease-free status after their initial diagnosis and treatment.