Skip to main content
An official website of the United States government

Data Types Collected by TCGA

The Cancer Genome Atlas (TCGA) collected many types of data for each of over 20,000 tumor and normal samples. Each step in the Genome Characterization Pipeline generated numerous data points, such as:

  • clinical information (e.g., smoking status)
  • molecular analyte metadata (e.g., sample portion weight)
  • molecular characterization data (e.g., gene expression values)

Below is supporting information and documentation for the different steps of molecular characterization.

Case Enrollment Documentation

Documents on case enrollment, followup, and other forms related to the intake of samples and clinical data are available from the Biospecimen Core Resource.

Sample Processing Documentation

TCGA used a compendium of standard operating procedures for processing tissues and other biological samples into molecular analytes for molecular characterization. These protocols are available from NCI's Biospecimen Research Database.

Summary of Data Types Collected

The data collected for a specific case in TCGA may have differed according to sample quality and quantity, cancer type, or technology available at the time of analysis. Below is a general summary of the types of clinical, molecular characterization, and other types of data that may have been generated for the different cancer types studied. 

Raw data (e.g. BAMs), germline and non-validated mutations, and genotypes are under controlled access (indicated in red). Derived data is available open access (exceptions are noted in table below).

All data collected and processed by the program is available at the Genomic Data Commons (GDC), including TCGA publication supplemental and associated data files. Questions about locating or accessing data should be directed to the GDC support team. Resources for TCGA users and TCGA FAQs are available.

Experimental protocols for each platform can be found in individual publications.

Summary of Data Types Collected
TypeSubtypeCancer Types ApplicableDescriptionFormatNotes
ClinicalClinical dataAllAvailable clinical information (may include demographic information, treatment information, survival data, etc)XML (per patient), tab-delimited TXT (grouped "biotab" per cancer type)Clinical data forms used by the TSS

Additional information in the Clinical Data Elements (CDE) Browser

Biospecimen dataAllInformation on how samples were processed by the Biospecimen Core Resource CenterXML (per patient), tab-delimited TXT (grouped "biotab" per cancer type)Protocols used by the BCR for processing of samples

Additional information in the CDE Browser

Pathology ReportsAllPathology reports (for select cases)PDF 
Copy NumberSNP microarrayAll CEL, IDAT, tab-delimited TXT (raw values per SNP, copy number, and loss of heterozygosity), tab-delimited TXT ( normalized values and purity/ploidy)Probe information contained in array design files for each platform
Copy number microarrayGBM, OV, LUSC tab-delimited TXT (raw signals per probe), tab-delimited TSV (normalized values per aggregated region), MAT,Probe information contained in array design files for each platform
Low-Pass DNA SequencingSome tumor typesLow pass, whole genome sequencing of tumor and normal matched samples and analysis of differences in read counts between tumor and normalBAM, VCF, tab-delimited TSV (normal v tumor calls) 
DNAWhole exomeAllWhole exome sequencing of  tumor and normal matched samplesBAM, VCF, MAF (mutation calls)Germline mutation calls and unvalidated non-coding somatic variants are controlled-access
Whole genomeAllWhole genome sequencing for tumor and normal matched samples (for select cases)BAM, VCF, MAF (mutation calls)Germline mutation calls and unvalidated non-coding somatic variants are controlled-access
SNP microarrayAll CEL, IDAT, tab-delimited TXT (raw values per SNP), tab-delimited TXT (genotypes per SNP)Germline mutation calls and unvalidated non-coding somatic variants are controlled-access
Sequence tracesGBM, OVRaw output from capillary sequencing technologySCF, TRMay be available at NCBI Trace Archive
ImagingDiagnostic imageAllWhole slide images of tissue used to diagnose participantSVSAvailable at the GDC, open access
Tissue imageAllWhole slide images of tissue samples from each participant that were used for TCGA analysesSVSAvailable at the GDC, open access
Radiological imageSomePre-surgical radiological imaging (e.g. MRI, CT, PET, etc) (for select cases)DCMAvailable at The Cancer Imaging Archive, open access
MethylationBisulfite sequencingSome tumor typesWhole genome sequencing performed after bisulfite treatment of tumor samplesBAM, VCF (methylation and mutation calls), BED (methylation calls per CpG site) 
Bead arrayAll tab-delimited TXT (raw signal values, beta values, beta values mapped to genome), IDATProbe information contained in array design files for each platform
Microsatellite Instability COAD, READ, UCECMarkers indicating presence or absence of a MSI shift, allele homozygosity/heterozygosity, and loss of heterozygosity observed in tumor samplesFSA, TXT (summary of trace file)MSI classifications within clinical biotab files
miRNAmiRNA SequencingAll except GBMmiRNA sequencing of tumor samplesBAM, tab-delimited TXT (normalized expression values per miRNA or isoform) 
Array-basedGBM, OV TXT (raw signals per probe, normalized expression values per probe, gene, or exons)Probe information contained in array design files for each platform
mRNA ExpressionmRNA sequencingAllmRNA sequencing of tumor sampls using a poly(A) enrichment RNA preparationBAM, TXT (normalized expression values per gene, isoform, exon, or splice junction)May be labeled as RNASeqV1 and RNASeqv2
Total RNA Sequencing Some tumor typesmRNA sequencing of tumor samples using ribosomal depletion RNA preparationBAM, TXT (normalized expression values per gene, isoform, exon, or splice junction)May be labeled as TotalRNASeqV2
MicroarrayBRCA, COAD, GBM, KIRC, KIRP, LAML, LGG, LUAD, LUSC, OV, READ, UCEC CEL (raw signals per probe), TXT (raw signals per probe, normalized expression values per probe, gene, or exons)Probe information contained in array design files for each platform
Protein ExpressionReverse-Phase Protein ArrayAllHigh resolution images of protein array slides (up to 1000 participant tumor samples per slide) and raw signals per slideTIFF, tab-delimited TXT (signal values, dilution curves, normalized expression values) 
  • Posted:

If you would like to reproduce some or all of this content, see Reuse of NCI Information for guidance about copyright and permissions. In the case of permitted digital reproduction, please credit the National Cancer Institute as the source and link to the original NCI product using the original product's title; e.g., “Data Types Collected by TCGA was originally published by the National Cancer Institute.”

Email