Data Types Collected by TCGA

The Cancer Genome Atlas (TCGA) collected many types of data for each of over 20,000 tumor and normal samples. Each step in the Genome Characterization Pipeline generated numerous data points, such as:

  • clinical information (e.g., smoking status)
  • molecular analyte metadata (e.g., sample portion weight)
  • molecular characterization data (e.g., gene expression values)

Below is supporting information and documentation for the different steps of molecular characterization.

Case Enrollment Documentation

Documents on case enrollment, followup, and other forms related to the intake of samples and clinical data are available from the Biospecimen Core Resource.

Sample Processing Documentation

TCGA used a compendium of standard operating procedures for processing tissues and other biological samples into molecular analytes for molecular characterization. These protocols are available from NCI's Biospecimen Research Database.

Summary of Data Types Collected

The data collected for a specific case in TCGA may have differed according to sample quality and quantity, cancer type, or technology available at the time of analysis. Below is a general summary of the types of clinical, molecular characterization, and other types of data that may have been generated for the different cancer types studied. 

Raw data (e.g. BAMs), germline and non-validated mutations, and genotypes are under controlled access (indicated in red). Derived data is available open access (exceptions are noted in table below).

All data is available at the Genomic Data Commons (GDC), including TCGA publication supplemental and associated data files. Questions about locating or accessing data should be directed to the GDC support teamNotes for users of the archived TCGA Data Portal and Data Access Matrix are also available.

Experimental protocols for each platform can be found in individual publications.

Summary of Data Types Collected
Type Subtype Cancer Types Applicable Description Format Notes
Clinical Clinical data All Available clinical information (may include demographic information, treatment information, survival data, etc) XML (per patient), tab-delimited TXT (grouped "biotab" per cancer type) Clinical data forms used by the TSS

Additional information in the Clinical Data Elements (CDE) Browser

Biospecimen data All Information on how samples were processed by the Biospecimen Core Resource Center XML (per patient), tab-delimited TXT (grouped "biotab" per cancer type) Protocols used by the BCR for processing of samples

Additional information in the CDE Browser

Pathology Reports All Pathology reports (for select cases) PDF
Copy Number SNP microarray All CEL, IDAT, tab-delimited TXT (raw values per SNP, copy number, and loss of heterozygosity), tab-delimited TXT ( normalized values and purity/ploidy) Probe information contained in array design files for each platform
Copy number microarray GBM, OV, LUSC tab-delimited TXT (raw signals per probe), tab-delimited TSV (normalized values per aggregated region), MAT, Probe information contained in array design files for each platform
Low-Pass DNA Sequencing Some tumor types Low pass, whole genome sequencing of tumor and normal matched samples and analysis of differences in read counts between tumor and normal BAM, VCF, tab-delimited TSV (normal v tumor calls)
DNA Whole exome All Whole exome sequencing of  tumor and normal matched samples BAM, VCF, MAF (mutation calls) Germline mutation calls and unvalidated non-coding somatic variants are controlled-access
Whole genome All Whole genome sequencing for tumor and normal matched samples (for select cases) BAM, VCF, MAF (mutation calls) Germline mutation calls and unvalidated non-coding somatic variants are controlled-access
SNP microarray All CEL, IDAT, tab-delimited TXT (raw values per SNP), tab-delimited TXT (genotypes per SNP) Germline mutation calls and unvalidated non-coding somatic variants are controlled-access
Sequence traces GBM, OV Raw output from capillary sequencing technology SCF, TR May be available at NCBI Trace Archive
Imaging Diagnostic image All Tissue images used to diagnose participant SVS
Tissue image All Images of tissue samples from each participant that were used for TCGA analyses SVS
Radiological image Some Pre-surgical radiological imaging (e.g. MRI, CT, PET, etc) (for select cases) DCM Available at The Cancer Imaging Archive
Methylation Bisulfite sequencing Some tumor types Whole genome sequencing performed after bisulfite treatment of tumor samples BAM, VCF (methylation and mutation calls), BED (methylation calls per CpG site)  
Bead array All   tab-delimited TXT (raw signal values, beta values, beta values mapped to genome), IDAT Probe information contained in array design files for each platform
Microsatellite Instability COAD, READ, UCEC Markers indicating presence or absence of a MSI shift, allele homozygosity/heterozygosity, and loss of heterozygosity observed in tumor samples FSA, TXT (summary of trace file) MSI classifications within clinical biotab files
miRNA miRNA Sequencing All except GBM miRNA sequencing of tumor samples BAM, tab-delimited TXT (normalized expression values per miRNA or isoform)  
Array-based GBM, OV   TXT (raw signals per probe, normalized expression values per probe, gene, or exons) Probe information contained in array design files for each platform
mRNA Expression mRNA sequencing All mRNA sequencing of tumor sampls using a poly(A) enrichment RNA preparation BAM, TXT (normalized expression values per gene, isoform, exon, or splice junction) May be labeled as RNASeqV1 and RNASeqv2
Total RNA Sequencing  Some tumor types mRNA sequencing of tumor samples using ribosomal depletion RNA preparation BAM, TXT (normalized expression values per gene, isoform, exon, or splice junction) May be labeled as TotalRNASeqV2
Microarray BRCA, COAD, GBM, KIRC, KIRP, LAML, LGG, LUAD, LUSC, OV, READ, UCEC CEL (raw signals per probe), TXT (raw signals per probe, normalized expression values per probe, gene, or exons) Probe information contained in array design files for each platform
Protein Expression Reverse-Phase Protein Array All High resolution images of protein array slides (up to 1000 participant tumor samples per slide) and raw signals per slide TIFF, tab-delimited TXT (signal values, dilution curves, normalized expression values)  
  • Posted: March 6, 2019

If you would like to reproduce some or all of this content, see Reuse of NCI Information for guidance about copyright and permissions. In the case of permitted digital reproduction, please credit the National Cancer Institute as the source and link to the original NCI product using the original product's title; e.g., “Data Types Collected by TCGA was originally published by the National Cancer Institute.”

We welcome your comments on this post. All comments must follow our comment policy.