Build a National Cancer Data Ecosystem

NCI has announced several funding opportunities that align with the Cancer Moonshot.

Databases and analytic tools have been an integral part of cancer research for decades. Recently these data have also become a central part of cancer care as we look to further tailor treatments using precision medicine.

Sharing and integrating data have become chief priorities for NCI as we seek to develop a robust infrastructure to ensure that everyone—researchers, clinicians, and patients—has a way to collaborate and share their collective data and knowledge about cancer.

The goal of this recommendation is to develop a National Cancer Data Ecosystem to enable and encourage all participants across the cancer research and care continuum to share, access, combine, and analyze diverse data, increasing the potential for new discoveries and reduce the burden of cancer.

The Cancer Data Ecosystem will be supported by a cloud-based infrastructure and will feature interactive portals that give users access to these data and allow for in-depth data analysis. This infrastructure will enable researchers, patients, and clinicians to incorporate their own data, fostering collaboration and advancing discoveries that improve our understanding of the mechanisms driving cancer—ultimately leading to more informed treatment choices and better patient outcomes.

NCI Cancer Research Data Commons (CRDC)

Some of this ecosystem is already underway. The NCI CRDC is a virtual data science infrastructure that connects cancer research data collections with analytical tools, leveraging the elastic computing power of the cloud. The CRDC is just one component of this broader Cancer Data Ecosystem and is central to NCI’s activities that support the Blue Ribbon Panel (BRP) recommendation. The CRDC can be used to store, analyze, share, and visualize cancer research data to improve our understanding of cancer. This initiative includes projects specifically aligned with the objectives of the National Cancer Data Ecosystem called for by the Cancer Moonshot, including:

  • The NCI Genomic Data Commons (GDC) is a resource for sharing genomic and clinical data to create a more complete understanding of genetic drivers of cancer.
  • The Proteomic Data Commons (PDC) is a resource for sharing and analyzing proteomic data. The PDC is populated with data from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) program and will grow to include other data sources over time.
  • The Data Commons Framework is a set of modular components that can be used across the CRDC. It includes user authentication and authorization to keep the data safe and secure, and digital identifiers, which allow researchers to study disparate data types across data nodes.
  • The NCI Cloud Resources allow researchers to access and analyze large-scale genomic, proteomic, and imaging data in the cloud using a variety of analytic tools and pipelines, without the need to download data to their local computer. The Cloud Resources provide researchers with secure workspaces, where they can store the results of their analyses, and optionally share them with other scientists, to foster greater collaboration and new discoveries.

CRDC also includes new projects specifically supported by Moonshot funding, including:

  • A Cancer Data Service will provide access to genomic data from NCI-funded research that is not currently available in the GDC.
  • The Imaging Data Commons (IDC) will be a resource for sharing and analyzing multi-modal imaging data from clinical and basic cancer research studies. The IDC will build on Google-provided tools such as BigQuery and the Google Healthcare API. It will also be fully scalable, to allow for the development of new tools and functionality to further enhance the IDC platform as new cancer research use cases are identified.
  • The Center for Cancer Data Harmonization (CCDH) will facilitate interoperability of the data across the CRDC. The CCDH team will support researchers who are submitting data to the CRDC and define common standards to allow researchers to search diverse data types and repositories.
  • The Cancer Data Aggregator (CDA) will allow researchers to combine data from diverse scientific domains and perform integrated analysis that can be shared with collaborators. The CDA will include tools, such as the use of common terms, to allow users to search and analyze data from different repositories.

In time, the CRDC will also provide access to other cancer research data, including information gleaned from animal models, immuno-oncology, and epidemiological cohorts.

Ultimately, NCI’s CRDC infrastructure and related resources will allow researchers, clinicians, and patients to share important data and resources to advance cancer research.

NCI is supporting several other research projects with Cancer Moonshot funding that contribute to the ability for the cancer research community to share and analyze data.

Privacy Preserving Patient Record Linkage Software

An important aspect of data sharing and the National Cancer Data Ecosystem is the ability to link data at the patient level across disparate data sources, while maintaining patient privacy and personal information. NCI is evaluating approaches for generating unique patient identifiers that will enable linkage of patient-level data from different sources without sharing identifiable information beyond those organizations that are authorized to hold such information. The software-generated identifiers will ensure that cancer patients’ data can be shared with the cancer research community without worry of disclosing patient identities or private information.

NCI Office of Data Sharing

The NCI Office of Data Sharing (ODS) was specifically created to advance data submissions and access processes to online databases. ODS also raises awareness of the Cancer Data Ecosystem through education and outreach, and has helped develop and implement the Cancer Moonshot public access and data sharing policy.

NCI Genomics Evidence Neoplasia Information Exchange (GENIE) Supplements

NCI is promoting genomic and clinical data sharing by cancer centers through supplements to those that are part of the GENIE consortium. The American Association for Cancer Research (AACR) Project GENIE (Genomics Evidence Neoplasia Information Exchange), which includes NCI-Designated Cancer Centers , is working to link genomic data with clinical outcomes from thousands of cancer patients. The program is also establishing standards for collecting and integrating clinical data from cancer patients. Through these supplements, GENIE genomic and clinical data will be shared with the GDC to further increase access to this information with researchers outside the GENIE consortium.

Moonshot APOLLO Imaging Project

NCI's The Cancer Imaging Archive (TCIA) will provide access to radiology and digitized pathology imaging data with extracted annotations and imaging features that will be linked to proteogenomic and clinical data in other CRDC repositories. These data will be generated as part of the Moonshot Applied Proteogenomics OrganizationaL Learning and Outcomes (APOLLO) project, a collaboration between NCI, the Department of Defense, and the Department of Veterans Affairs. This project will provide a unique cross-disciplinary resource to accelerate the application of proteogenomics to patient care.

De-identification of Narrative Text Clinical Documents

The majority of patients’ medical records are in a narrative text format. These clinical documents hold valuable information for cancer researchers but are not easily accessible because of the personally identifiable information (PII) that must be removed prior to sharing data with the research community. NCI is evaluating narrative text de-identification systems to select reliable tool(s) that can be used within the National Cancer Data Ecosystem and across NCI to successfully de-identify clinical documents for research use.

Cancer Data Ecosystem Projects Awarded Cancer Moonshot Funding

