Data & Statistics
Current Release (September 26th, 2020)
Protein-Protein Interactions
PPIs Stats
Organisms | Binary Interactions* | Complexes** |
---|---|---|
Homo sapiens | 439,714 | 15,252 |
Saccharomyces cerevisiae | 128,319 | 6,302 |
Caenorhabditis elegans | 22,305 | 105 |
Drosophila melanogaster | 57,578 | 810 |
Mus musculus | 57,669 | 1,304 |
Rattus norvegicus | 5,796 | 307 |
Arabidopsis thaliana | 56,282 | 431 |
* Number of interactions: The number is the sum of self-interaction and binary
interaction that
all participating proteins have UniProt Accession Number.
** Number of complexes: The number is the sum of complexes that all participating
proteins have
UniProt Accession Number.
Data Sources
Original Database | Version | |
---|---|---|
IntAct | version 4.2.15 | |
BioGRID | version 3.5.185 | |
MINT | May 21, 2020 | |
DIP | version 20170205 | |
HPRD | release 9 |
We identified the same interaction records in the different databases to
build a non-redundant dataset.
We also utilized BioMart and UniProt to annotate each protein with the same high-quality
information because some of the original records have limited annotation.
Cancer Data
Data Sources & Preprocessing
- Tumor type-specific cancer driver genes were from a recent TCGA Pan-cancer analysis of 9,423 tumor exomes.
- Targets of therapeutic compounds were downloaded from the Genomics of Drug Sensitivity in Cancer (GDSC).
- Cancer transcriptome profiles were downloaded from the Genomic Data Commons (GDC) portal of TCGA (version 20190101). The batch-corrected and upper quartile normalized RSEM measurements were log2 transformed for mRNA expression analysis.
- Cancer proteome data were downloaded from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) data portal (version 20200511). The relative abundance of proteins generated by the Common Data Analysis Pipeline (CDAP) was subjected to quantile normalization using the normalizeQuantiles function implemented in R package limma v3.36.1.
- Both mRNA and protein expression datasets were further filtered by removing genes with zero or NA values in more than 80% of samples.
- Clinical data (survival time, tumor site, age, ethnicity, and grade) were downloaded from both GDC and CPTAC for corresponding samples with molecular data.
Data Policy
We adhere all aspects of data access and usage policies of the original studies. PINA users should strictly adhere to the policies of the NIH Genomic Data Sharing (GDS) Policy for utilizing RNA-seq/clinical data integrated into PINA, and the CPTAC Data Use Agreement for utilizing proteomic data integrated into PINA.
Data Stats
mRNA expression datasets:
Dataset name | Cancer name | No. of patients* | No. of genes |
---|---|---|---|
TCGA-ACC | Adrenocortical carcinoma | 79 | 18,136 |
TCGA-BLCA | Bladder Urothelial Carcinoma | 408 | 18,558 |
TCGA-BRCA | Breast invasive carcinoma | 1,095 | 18,563 |
TCGA-CESC | Cervical squamous cell carcinoma and endocervical adenocarcinoma | 304 | 18,491 |
TCGA-CHOL | Cholangiocarcinoma | 36 | 18,377 |
TCGA-COAD | Colon adenocarcinoma | 451 | 18,039 |
TCGA-DLBC | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma | 48 | 18,158 |
TCGA-ESCA | Esophageal carcinoma | 184 | 18,963 |
TCGA-GBM | Glioblastoma multiforme | 154 | 18,473 |
TCGA-HNSC | Head and Neck squamous cell carcinoma | 520 | 18,671 |
TCGA-KICH | Kidney Chromophobe | 66 | 18,152 |
TCGA-KIRC | Kidney renal clear cell carcinoma | 533 | 18,588 |
TCGA-KIRP | Kidney renal papillary cell carcinoma | 290 | 18,331 |
TCGA-LAML | Acute Myeloid Leukemia | 173 | 16,731 |
TCGA-LIHC | Liver hepatocellular carcinoma | 371 | 18,419 |
TCGA-LUAD | Lung adenocarcinoma | 515 | 18,620 |
TCGA-LGG | Brain Lower Grade Glioma | 516 | 18,586 |
TCGA-LUSC | Lung squamous cell carcinoma | 501 | 18,777 |
TCGA-MESO | Mesothelioma | 87 | 18,488 |
TCGA-OV | Ovarian serous cystadenocarcinoma | 304 | 18,950 |
TCGA-PAAD | Pancreatic adenocarcinoma | 178 | 18,709 |
TCGA-PCPG | Pheochromocytoma and Paraganglioma | 179 | 18,318 |
TCGA-PRAD | Prostate adenocarcinoma | 497 | 18,710 |
TCGA-READ | Rectum adenocarcinoma | 160 | 18,040 |
TCGA-SARC | Sarcoma | 259 | 18,582 |
TCGA-SKCM | Skin Cutaneous Melanoma | 103 | 18,422 |
TCGA-STAD | Stomach adenocarcinoma | 415 | 18,972 |
TCGA-TGCT | Testicular Germ Cell Tumors | 150 | 19,270 |
TCGA-THCA | Thyroid carcinoma | 505 | 18,307 |
TCGA-THYM | Thymoma | 120 | 18,561 |
TCGA-UCEC | Uterine Corpus Endometrial Carcinoma | 532 | 17,629 |
TCGA-UCS | Uterine Carcinosarcoma | 57 | 18,918 |
TCGA-UVM | Uveal Melanoma | 80 | 17,679 |
Protein expression datasets:
Dataset name | Cancer name | No. of patients* | No. of proteins |
---|---|---|---|
CPTAC-CCRCC | Clear Cell Renal Cell Carcinoma | 110 | 9,445 |
CPTAC-COAD | Colon Adenocarcinoma | 97 | 7,057 |
CPTAC-EC | Endometrial Carcinoma | 100 | 10,418 |
CPTAC-GC | Gastric Cancer | 80 | 8,732 |
CPTAC-HCC | Hepatocellular Carcinoma | 159 | 9,682 |
CPTAC-LUAD | Lung Adenocarcinoma | 111 | 10,546 |
TCGA-BRCA | Breast Invasive Carcinoma | 105 | 9,747 |
TCGA-OV | Ovarian Serous Cystadenocarcinoma | 174 | 7,703 |
* Only primary tumors were included.