1. Introduction

CVCDAP stands for cancer virtual cohort discovery analysis platform. It is an open-access data portal of a large variety of cancer multi-omics data including public data and self-produced data. The CVCDAP provides convenient data visualization and computational tools for the analysis of customizable cohorts. All the data and informatics tools are made freely available to a wider community of cancer researchers.

2. Data source

We download clinical, somatic mutation, mRNA data of TCGA from the NCI Genomic Data Commons Data Portal and proteome data from CPTAC. Datasets from NCI Genomic Data Commons Data Portal represent the most uniform attempt to systematically provide multiomics data for TCGA tumors used for pancancer analysis. We only remained primary tumors.

Clinical Data : It was obtained from https://api.gdc.cancer.gov/data/1b5f413e-a8d1-4d10-92eb-7c4ae739ed81, we remained attributes like age, gender, race, stage, grade and mapped projectID, disease type, primary site and calculated tumor mutation burden (TMB) (TMB_nonsyn and TMB_total).

Mutation: It was obtained from http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc, which includes mutation calls for each sample.

RNAseq: It was obtained from http://api.gdc.cancer.gov/data/3586c0da-64d0-4b74-a449-5ff4d9136611. We corrected gene symbols firstly and imputed by different disease type, then merged and convert fpkm to tpm, and stored log2tpm expression signal of individual gene per sample into database.

Proteomics: At our present database, we only collected data produced by iTRAQ4 and TMT10. After removing proteins whose Completeness less than 50%, we imputed by different studies, then merged and just remained overlapped proteins, performed quantile normalization. Normalized protein expression for each gene, per sample were stored in our database.

Data Summary

3. Define Virtual Cohorts

A virtual cohort is a custom study comprised of samples from one or more existing studies. CVCDAP allows users to define a custom cohort of samples that fit user specific tissue, genomic or clinical criteria of interest. These samples can be a subset of the data available in an existing study, or result from the combination of multiple existing studies. This cohort of samples can then be queried or explored just like a traditional study.

4. Single Cohort Analysis

4.1 Genomic analyses toolbox

Genecloud

This tool allows users to plot word cloud for mutated genes

Parameters:

  1. Cohort: Select a cohort
  2. minMut: Minimum number of samples in which a gene is required to be mutated. Default 3.
  3. top: Just plot these top n number of mutated genes. Default NULL.
Onco-plot

This tool allows users to overview the genetic alterations per sample in each gene by heatmap.

Parameters:

  1. Cohort: Select a cohort
  2. top: how many top genes to be drawn. defaults to 20.
  3. genes: Just draw oncoplot for these genes. Minimum two genes required! Default NULL.
  4. titleFontSize: font size for title. Default 15
  5. legendFontSize: font size for legend. Default 12
Onco-strip

This tool allows users to visualize mutations of any set of genes

Parameters:

  1. Cohort: Select a cohort
  2. top: how many top genes to be drawn. defaults to 5.
  3. genes: Just draw oncoplot for these genes. Minimum two genes required! Default NULL.
  4. titleFontSize: font size for title. Default 15
  5. legendFontSize: font size for legend. Default 12
Titv Plot

This tool allows users to draw three boxplots, which shows proportion of transitions and transversions, overall distribution of six different conversions and fraction of conversions in each sample.

Parameters:

  1. Cohort: Select a cohort
VAF Plot

This tool allows users to plot Variant Allele Frequencies as a boxplot which quickly helps to estimate clonal status of top mutated genes (clonal genes usually have mean allele frequency around ~50% assuming pure sample)

Parameters:

  1. Cohort: Select a cohort
  2. Genes: Specify genes for which plots which will to be generated.
  3. Top: If genes is NULL plots top n number of genes. Defaults to 5.
Lollipop Plot

This tool allows users to draw a lollipop plot -- a hybrid between a scatter plot and a barplot, which shows mutation spots on protein structure

Parameters:

  1. Cohort: Select a cohort
  2. Gene: HGNC symbol for which protein structure to be drawn.
Rainfall Plot

This tool allows users to draw a rainfall plot , which can be seen as a scatter plot showing the location of events on the x-axis versus the distance between consecutive events on the y-axis.

Parameters:

  1. Cohort: Select a cohort
  2. Tumor Sample Barcode: Specify sample names (Tumor_Sample_Barcodes) for which plotting has to be done. If NULL, draws plot for most mutated sample.
  3. DetectChangePoints: If TRUE, detectes genomic change points where potential kataegis are formed. Results are written to an output tab delimted file.
Driver Gene

This tool allows users to identify cancer genes. (driver)

Parameters:

  1. Cohort: Select a cohort
  2. MinMut: minimum number of mutations required for a gene to be included in analysis. Default 5.
  3. FDR Cutoff: fdr cutoff to call a gene as a driver.
Mutation Signature

This tool allows users to determine the contribution of known mutational processes.

Parameters:

  1. Cohort: Select a cohort
  2. SignaturesRef : Either a data frame or location of signature text file, where rows are signatures, columns are trinucleotide contexts. Set to either: "signatures.nature2013" or "signatures.cosmic"
  3. TriCountsMethod: Set to either:
    'default' – no further normalization.
    'exome' – normalized by number of times each trinucleotide context is observed in the exome
    'genome' – normalized by number of times each trinucleotide context is observed in the genome.
    'exome2genome'– multiplied by a ratio of that trinucleotide's occurence in the genome to the trinucleotide's occurence in the exome
    'genome2exome' – multiplied by a ratio of that trinucleotide's occurence in the exome to the trinucleotide's occurence in the genome data frame containing user defined scaling factor – count data for each trinucleotide context is multiplied by the corresponding value given in the data frame.
  4. DetectChangePoints: If TRUE, detectes genomic change points where potential kataegis are formed. Results are written to an output tab delimted file.
Drug-Gene Interaction

This tool allows users to plot potential druggable gene categories as a boxplot , which checks for drug–gene interactions and gene druggability information compiled from Drug Gene Interaction database

Parameters:

  1. Cohort: Select a cohort
  2. Top: Top number genes to check for. Default 20.
  3. Genes: Manually specify gene list.
Mutual Exclusivity

This tool allows users to determine if query genes are mutually exclusively altered by performs pair-wise Fisher’s Exact test.

Parameters:

  1. Cohort: Select a cohort
  2. Top: Check for interactions among top 'n' number of genes. Defaults to top 25. genes
  3. Genes: List of genes among which interactions should be tested. Minimum 5 genes required! If not provided, test will be performed between top 25 genes.
  4. Upper P-value: Upper p-value threshold. Default 0.05.
  5. Lower P-value: Lower p-value threshold. Default 0.01.

4.2 mRNA analysis toolbox

PCA Plot

This tool allows users to conduct Principal Component Analysis (PCA) to better visualize the variation present in a cohort with a given gene list.

Parameters:

  1. Cohort: Select a cohort
  2. Gene List: Input a gene list
  3. Label: Optional clinical variable indicating the groups that the samples belong to. If provided the points will be colored according to groups.
  4. Circle: Draw a correlation circle? Default TRUE.
t-SNE Plot

This tool allows users to perform T-distributed Stochastic Neighbor Embedding (t-SNE) for visualization in a low-dimensional space of two dimensions in a cohort with a given gene list.

Parameters:

  1. Cohort: Select a cohort
  2. Gene List: Input a gene list
  3. Label: Optional clinical variable indicating the groups that the samples belong to. If provided the points will be colored according to groups.
  4. Label Point Size: Size of the points used for the labels.
Clustering

This tools allows users to perform k-means clustering or hierarchical cluster analysis on a given cohort.

Parameters:

  1. Cohort: Select a cohort
  2. Gene List: Input a gene list.
  3. k: The number of clusters.

4.3 Protein analysis toolbox

PCA Plot

This tool allows users to conduct Principal Component Analysis (PCA) to better visualize the variation present in a cohort with a given gene list.

Parameters:

  1. Cohort: Select a cohort
  2. Gene List: Input a gene list
  3. Label: Optional clinical variable indicating the groups that the samples belong to. If provided the points will be colored according to groups.
  4. Circle: Draw a correlation circle? Default TRUE.
t-SNE Plot

This tool allows users to perform T-distributed Stochastic Neighbor Embedding (t-SNE) for visualization in a low-dimensional space of two dimensions in a cohort with a given gene list.

Parameters:

  1. Cohort: Select a cohort
  2. Gene List: Input a gene list
  3. Label: Optional clinical variable indicating the groups that the samples belong to. If provided the points will be colored according to groups.
  4. Label Point Size: Size of the points used for the labels.
Clustering

This tools allows users to perform k-means clustering or hierarchical cluster analysis on a given cohort.

Parameters:

  1. Cohort: Select a cohort
  2. Gene List: Input a gene list.
  3. k: The number of clusters.

4.4 Clinical Analysis Toolbox

Single Cohort Survival Analysis

This tool allows users to compare survival time (OS, DSS, PFI, DFI) of patients divided into two subgroups by clinical characterization, such as stage, or a gene if mutated, or the gene expression.The survival of two subgroup patients was compared and tested with the Log-Rank test.

Parameters:

  1. Cohort: Select a cohort
  2. Gene: Input a gene symbol
  3. Cutoff: Samples with expression level higher than this threshold are considered as the high-expression cohort, others as the low expression cohort.

5. Two Cohort Analysis

5.1 Genomic Analysis Toolbox

Forest Plot

This tool allows users to compare two different cohorts to detect differentially mutated genes and visualize the results as a forestplot.

Parameters:

  1. Cohort 1: Select the first Cohort.
  2. Cohort 2: Select the second Cohort.
  3. Name 1: Optional name for the first cohort.
  4. Name 2: Optional name for the second cohort.
  5. MinMut: Consider only genes with minimum this number of samples mutated in at least one of the cohort for analysis. Helpful to ignore single mutated genes. Default 5.
  6. P-value: p-value threshold. Default 0.05.
  7. FDR Cutoff: fdr threshold. Default NULL. If provided uses adjusted pvalues.
  8. Color 1: Optional color for first cohort.
  9. Color 2: Optional color for first cohort.
  10. Gene Name Size: Font size for gene symbols. Default 1.2.
  11. Height: Height of plot to be generated. Default 5.
  12. Width: Width of plot to be generated. Default 6.
Co-oncoPlot

This tool allows users to display genomic alterations of query gene as oncoplot and plot them side by side for better comparison.

Parameters:

  1. Cohort 1: Select the first Cohort
  2. Cohort 2: Select the second Cohort
  3. Name 1: Optional name for the first cohort
  4. Name 2: Optional name for the second cohort
  5. Genes: Draw these genes. Default plots top 5 mutated genes from two cohorts.
  6. Gene Name Size: Font size for gene names. Default 10
  7. Legend Font Size: Font size for legend. Default 10
  8. Title Font Size: Font size for legend. Default 10
  9. Height: Height of the graphics region in inches. Default 7.
  10. Width: Width of plot to be generated. Default 6.
Lollipop Plot2

This tool allows users to display mutation spots on protein structure of query gene and plot them wisely for better comparison

Parameters:

  1. Cohort 1: Select the first Cohort.
  2. Cohort 2: Select the second Cohort.
  3. Name 1: Optional name for the first cohort.
  4. Name 2: Optional name for the second cohort.
  5. Gene: HGNC symbol for which protein structure to be drawn.

5.2 mRNA Analysis Toolbox

DEG

This tool allows users to apply custom p-value or FDR cutoff thresholds to dynamically obtain differentially expressed genes.

Parameters:

  1. Cohort 1: Select the first Cohort.
  2. Cohort 2: Select the second Cohort.
  3. FDR Cutoff: Cutoff value for adjusted p-values. Only genes with lower p-values are listed.
  4. log2FC Cutoff: Minimum absolute log2-fold-change required.
Heatmap

This tool allows users to plot heatmap, which are used to to represent the level of expression of many genes across a number of comparable samples.

Parameters:

  1. Cohort 1: Select the first Cohort.
  2. Cohort 2: Select the second Cohort.
  3. FDR Cutoff: Cut-off for adjusted p value. Horizontal line will be drawn at -log10(pCutoff). DEFAULT = 0.05.
  4. log2FC Cutoff: Cut-off for absolute log2 fold-change. Vertical lines will be drawn at the negative and positive values of FCCutoff. DEFAULT = 2.0.
  5. kmeans_k: The number of kmeans clusters to make, if we want to aggregate the rows before drawing heatmap. If NA then the rows are not aggregated.
  6. Cluster Rows: Boolean values determining if rows should be clustered or hclust object.
  7. Cluster Columns: Boolean values determining if columns should be clustered or hclust object.
  8. Show Colnames: boolean specifying if column names are shown.
  9. Title: The title of the plot
  10. Fontsize: Base fontsize for the plot
Volcano Plot

This tool allows users to draw a volcano plot. A volcano plot is a type of scatterplot that shows statistical significance (P value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant.

Parameters:

  1. Cohort 1: Select the first Cohort.
  2. Cohort 2: Select the second Cohort.
  3. FDR Cutoff: Cut-off for adjusted p value. Horizontal line will be drawn at -log10(pCutoff). DEFAULT = 0.05
  4. log2FC Cutoff: Cut-off for absolute log2 fold-change. Vertical lines will be drawn at the negative and positive values of FCCutoff. DEFAULT = 2.0
  5. Genes: Only these genes that pass FCcutoff and pCutoff thresholds will be labelled in the plot. DEFAULT = NULL
  6. Draw Connectors: Fit labels onto plot and connect to their respective points by lines (TRUE/FALSE). DEFAULT = FALSE
  7. kmeans_k: The number of kmeans clusters to make, if we want to aggregate the rows before drawing heatmap. If NA then the rows are not aggregated.
  8. Cluster Rows: Boolean values determining if rows should be clustered or hclust object.
  9. Cluster Columns: Boolean values determining if columns should be clustered or hclust object.
  10. Show Colnames: boolean specifying if column names are shown.
  11. Title: The title of the plot
  12. Fontsize: Base fontsize for the plot
GSEA

This tool allows users to run GSEA. Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically. significant, concordant differences between two biological states. (e.g. phenotypes).

Parameters:

  1. Cohort 1: Select the first Cohort.
  2. Cohort 2: Select the second Cohort.
  3. Name 1: Optional name for the first cohort
  4. Name 2: Optional name for the second cohort
  5. Gene: input a gene symbol.
Box Plot

This tool allows users to generates box plots with jitter for comparing expression of query gene in two cohorts.

Parameters:

  1. Cohort 1: Select the first Cohort.
  2. Cohort 2: Select the second Cohort.
  3. Name 1: Optional name for the first cohort
  4. Name 2: Optional name for the second cohort
  5. Gene: input a gene symbol.

5.3 Protein Analysis Toolbox

DEP

This tool allows users to apply custom p-value or FDR cutoff thresholds to dynamically obtain differentially expressed genes.

Parameters:

  1. Cohort 1: Select the first Cohort.
  2. Cohort 2: Select the second Cohort.
  3. FDR Cutoff: Cutoff value for adjusted p-values. Only genes with lower p-values are listed.
  4. log2FC Cutoff: Minimum absolute log2-fold-change required.
Heatmap

This tool allows users to plot heatmap, which are used to to represent the level of expression of many genes across a number of comparable samples.

Parameters:

  1. Cohort 1: Select the first Cohort.
  2. Cohort 2: Select the second Cohort.
  3. FDR Cutoff: Cut-off for adjusted p value. Horizontal line will be drawn at -log10(pCutoff). DEFAULT = 0.05.
  4. log2FC Cutoff: Cut-off for absolute log2 fold-change. Vertical lines will be drawn at the negative and positive values of FCCutoff. DEFAULT = 2.0.
  5. kmeans_k: The number of kmeans clusters to make, if we want to aggregate the rows before drawing heatmap. If NA then the rows are not aggregated.
  6. Cluster Rows: Boolean values determining if rows should be clustered or hclust object.
  7. Cluster Columns: Boolean values determining if columns should be clustered or hclust object.
  8. Show Colnames: boolean specifying if column names are shown.
  9. Title: The title of the plot
  10. Fontsize: Base fontsize for the plot
Volcano Plot

This tool allows users to draw a volcano plot. A volcano plot is a type of scatterplot that shows statistical significance (P value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant.

Parameters:

  1. Cohort 1: Select the first Cohort.
  2. Cohort 2: Select the second Cohort.
  3. FDR Cutoff: Cut-off for adjusted p value. Horizontal line will be drawn at -log10(pCutoff). DEFAULT = 0.05
  4. log2FC Cutoff: Cut-off for absolute log2 fold-change. Vertical lines will be drawn at the negative and positive values of FCCutoff. DEFAULT = 2.0
  5. Genes: Only these genes that pass FCcutoff and pCutoff thresholds will be labelled in the plot. DEFAULT = NULL
  6. Draw Connectors: Fit labels onto plot and connect to their respective points by lines (TRUE/FALSE). DEFAULT = FALSE
  7. kmeans_k: The number of kmeans clusters to make, if we want to aggregate the rows before drawing heatmap. If NA then the rows are not aggregated.
  8. Cluster Rows: Boolean values determining if rows should be clustered or hclust object.
  9. Cluster Columns: Boolean values determining if columns should be clustered or hclust object.
  10. Show Colnames: boolean specifying if column names are shown.
  11. Title: The title of the plot
  12. Fontsize: Base fontsize for the plot
GSEA

This tool allows users to run GSEA. Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically. significant, concordant differences between two biological states. (e.g. phenotypes).

Parameters:

  1. Cohort 1: Select the first Cohort.
  2. Cohort 2: Select the second Cohort.
  3. Name 1: Optional name for the first cohort
  4. Name 2: Optional name for the second cohort
  5. Gene: input a gene symbol.
Box Plot

This tool allows users to generates box plots with jitter for comparing expression of query gene in two cohorts.

Parameters:

  1. Cohort 1: Select the first Cohort.
  2. Cohort 2: Select the second Cohort.
  3. Name 1: Optional name for the first cohort
  4. Name 2: Optional name for the second cohort
  5. Gene: input a gene symbol.

5.4 Clinical Analysis Toolbox

Two Cohort Survival Analysis

This tool allows users to compare survival time (OS, DSS, PFI, DFI) of patients in two cohorts.

Parameters:

  1. Cohort 1: Select the first Cohort.
  2. Cohort 2: Select the second Cohort.
  3. Name 1: Optional name for the first cohort
  4. Name 2: Optional name for the second cohort