Happy 3rd Birthday, Genomic Data Commons: Continued Data and Tool Growth
, by Louis M. Staudt, M.D., Ph.D.
The Genomic Data Commons—NCI’s molecular and clinical data sharing and analysis system—launched nearly three years ago. Though time seems to pass quickly (can it be that three years in software development is faster than three human years?), we’ve made significant accomplishments in adding data, improving our infrastructure, and developing new tools for all types of researchers to utilize.
Continuing to Develop Best-in-Practice Pipelines
Best-in-practice processing pipelines for the most common molecular characterization platforms are a core necessity for GDC. We’ve refined our pipelines for whole-genome and RNA sequencing and calling copy number variants. On multiple occasions, our bioinformaticians have even caught bugs in widely-used software.
We’ve also updated our targeted sequencing pipeline and germline masking strategy for calling tumor-only mutations and added a new workflow for methylation array data. New pipelines we are working on include detecting gene fusions and microsatellite instability.
These pipelines are vital, but I’m keenly aware of the additional need for visualizations and analysis tools to help researchers utilize the processed data. To this end, we’ve added several visualizations for copy number variants in GDC's Data Analysis, Visualization, and Exploration (DAVE) tools, including a way to view them in combination with small-scale substitutions and indels in Oncogrid.
Summary of Data Added in the Last Year
We’ve added or updated numerous data sets in the last year. Notably, we’ve released data from large-scale studies such as NCI's CPTAC. The number of cases and types of data available for NCI’s TARGET project has also steadily grown.
We’ve done some further data structure “remodeling” to accommodate the new, complex data coming in. We’ve also incorporated standardized cancer terminology codes. For example, TCGA samples are now tagged by ICD-O-3 codes such as primary site, disease type, and diagnosis type.
We’re working with more groups to submit and harmonize even more data. I applaud all of these groups who have committed to data sharing. One of our long-term goals is to streamline this process so that there is a straightforward, standardized approach no matter how unique your samples or molecular experiments are.
|APOLLO||Targeted sequencing from The Applied Proteogenomics OrganizationaL Learning and Outcomes, NCI’s collaboration with Department of Defense and Veterans Affairs|
WXS alignments and somatic mutations for 50+ cases of relapsed or refractory acute myeloid leukemia
|CPTAC||Collaboration with NCI’s Clinical Proteomic Tumor Analysis Consortium, producing WGS, WXS, and RNA-Seq for 322 cases|
|DLBCL||Targeted sequencing and RNA-Seq for 534 diffuse large B-cell lymphoma cases|
|HCMI*||WXS, WGS, RNA-Seq, and clinical data for an initial batch of cancer models from NCI’s Human Cancer Models Initiative|
|MMRF*||Collaboration with the Multiple Myeloma Research Foundation. WXS and RNA-Seq for ~1000 cases of multiple myeloma, including longitudinal genomic and clinical data|
|TARGET||WGS updates, additional cases of neuroblastoma, acute myeloid leukemia*, acute lymphoblastic leukemia, and wilms tumor*, 652 cases total of childhood cancers|
|TCGA||Updates to TCGA clinical data, including International Classification of Diseases (ICD) codes and treatment type, aligned reads from the ATAC-Seq study*|
Coming Soon: MMRF, Clinical Data Exploration
We are preparing for our release of about 1000 cases of multiple myeloma data in partnership with the Multiple Myeloma Research Foundation. I’m especially excited about this data because it includes genomic and clinical data from multiple timepoints for patients. This marks our first addition of longitudinal data for researchers to ask questions about how a disease progresses or responds to different forms of therapy.
A major part of the vision for GDC has been to help researchers explore and “play” with the data. We added DAVE two years ago for genomic data, and we are soon adding a counterpart for clinical data. Users will be able to directly view what clinical variables are available, use them to filter cases and build synthetic cohorts in a more intuitive manner, and see how they affect overall and progression-free survival. Users will also be able to customize histogram bins and bar charts right in the web browser.
Find GDC at ASCO 2019
Experts from the GDC will be at the American Society of Clinical Oncology Annual Meeting held at McCormick Place in Chicago, IL this year to answer questions, help you find data, or even provide personal tours of our data portal. As always, we look forward to your feedback to help make the GDC better for the research community.
Find the GDC at the NCI Booth #4075
Saturday, June 1 to Monday, June 3
9AM - 5PM