Happy 3rd Birthday, Genomic Data Commons: Continued Data and Tool Growth

May 28, 2019, by Louis M. Staudt, M.D., Ph.D.

National Cancer Institute Genomic Data Commons turns three years old — NCI's Genomic Data Commons turns 3 years old in June.

The Genomic Data Commons—NCI’s molecular and clinical data sharing and analysis system—launched nearly three years ago. Though time seems to pass quickly (can it be that three years in software development is faster than three human years?), we’ve made significant accomplishments in adding data, improving our infrastructure, and developing new tools for all types of researchers to utilize.

Continuing to Develop Best-in-Practice Pipelines

Best-in-practice processing pipelines for the most common molecular characterization platforms are a core necessity for GDC. We’ve refined our pipelines for whole-genome and RNA sequencing and calling copy number variants. On multiple occasions, our bioinformaticians have even caught bugs in widely-used software.

We’ve also updated our targeted sequencing pipeline and germline masking strategy for calling tumor-only mutations and added a new workflow for methylation array data. New pipelines we are working on include detecting gene fusions and microsatellite instability.

These pipelines are vital, but I’m keenly aware of the additional need for visualizations and analysis tools to help researchers utilize the processed data. To this end, we’ve added several visualizations for copy number variants in GDC's Data Analysis, Visualization, and Exploration (DAVE) tools, including a way to view them in combination with small-scale substitutions and indels in Oncogrid.

Summary of Data Added in the Last Year

We’ve added or updated numerous data sets in the last year. Notably, we’ve released data from large-scale studies such as NCI's CPTAC. The number of cases and types of data available for NCI’s TARGET project has also steadily grown.

We’ve done some further data structure “remodeling” to accommodate the new, complex data coming in. We’ve also incorporated standardized cancer terminology codes. For example, TCGA samples are now tagged by ICD-O-3 codes such as primary site, disease type, and diagnosis type.

We’re working with more groups to submit and harmonize even more data. I applaud all of these groups who have committed to data sharing. One of our long-term goals is to streamline this process so that there is a straightforward, standardized approach no matter how unique your samples or molecular experiments are.

Data Additions and Updates to the Genomic Data Commons
Project	Notes
APOLLO	Targeted sequencing from The Applied Proteogenomics OrganizationaL Learning and Outcomes, NCI’s collaboration with Department of Defense and Veterans Affairs
BEATAML-CRENANOLANIB (coming soon)	WXS alignments and somatic mutations for 50+ cases of relapsed or refractory acute myeloid leukemia
CPTAC	Collaboration with NCI’s Clinical Proteomic Tumor Analysis Consortium, producing WGS, WXS, and RNA-Seq for 322 cases
DLBCL	Targeted sequencing and RNA-Seq for 534 diffuse large B-cell lymphoma cases
HCMI (coming soon)	WXS, WGS, RNA-Seq, and clinical data for an initial batch of cancer models from NCI’s Human Cancer Models Initiative
MMRF (coming soon)	Collaboration with the Multiple Myeloma Research Foundation. WXS and RNA-Seq for ~1000 cases of multiple myeloma, including longitudinal genomic and clinical data
TARGET	WGS updates, additional cases of neuroblastoma, acute myeloid leukemia (coming soon), acute lymphoblastic leukemia, and wilms tumor (coming soon), 652 cases total of childhood cancers
TCGA	Updates to TCGA clinical data, including International Classification of Diseases (ICD) codes and treatment type, aligned reads from the ATAC-Seq study

Coming Soon: MMRF, Clinical Data Exploration

We are preparing for our release of about 1000 cases of multiple myeloma data in partnership with the Multiple Myeloma Research Foundation. I’m especially excited about this data because it includes genomic and clinical data from multiple timepoints for patients. This marks our first addition of longitudinal data for researchers to ask questions about how a disease progresses or responds to different forms of therapy.

A major part of the vision for GDC has been to help researchers explore and “play” with the data. We added DAVE two years ago for genomic data, and we are soon adding a counterpart for clinical data. Users will be able to directly view what clinical variables are available, use them to filter cases and build synthetic cohorts in a more intuitive manner, and see how they affect overall and progression-free survival. Users will also be able to customize histogram bins and bar charts right in the web browser.

Find GDC at ASCO 2019

Experts from the GDC will be at the American Society of Clinical Oncology Annual Meeting held at McCormick Place in Chicago, IL this year to answer questions, help you find data, or even provide personal tours of our data portal. As always, we look forward to your feedback to help make the GDC better for the research community.

Find the GDC at the NCI Booth #4075
Saturday, June 1 to Monday, June 3
9AM - 5PM

Happy 3rd Birthday, Genomic Data Commons: Continued Data and Tool Growth

Continuing to Develop Best-in-Practice Pipelines

Summary of Data Added in the Last Year

Coming Soon: MMRF, Clinical Data Exploration

Find GDC at ASCO 2019

Archive