Bioinformatics and Cancer
The volume of biological data collected during the course of biomedical research has exploded, thanks in large part to powerful new research technologies.
The availability of these data, and the insights they may provide into the biology of disease, has many in the research community excited about the possibility of expediting progress toward precision medicine—that is, tailoring prevention, diagnosis, and treatment based on the molecular characteristics of a patient’s disease.
Mining the sheer volume of "Big Data" to answer the complex biological questions that will bring precision medicine into the mainstream of clinical care, however, remains a challenge.
Nowhere is this challenge more evident than in oncology. By some estimates, by the end of 2017, just the research performed using next-generation sequencing of patient genomes will produce one exabyte—one quintillion bytes, 1018 bytes, or a million times a million times a million bytes—of data annually. Much of these data will come from studies of patients with cancer.
Seeking Answers from Big Data in the Era of Precision Medicine
Establishing an infrastructure to store, analyze, integrate, and visualize large amounts of biological data and related information, as well as providing access to it, is the focus of bioinformatics.
Bioinformatics uses advanced computing, mathematics, and different technological platforms to physically store, manage, analyze, and understand the data.
Currently, researchers use many different tools and platforms to store and analyze the biological data they collect during the course of their research, including data from whole genome sequencing, advanced imaging studies, comprehensive analyses of the proteins in biological samples, and clinical annotations.
It is often difficult to integrate and analyze data from these various platforms, however, and often researchers don’t have access to the raw or primary data created by other studies or lack the computational tools and infrastructure necessary to integrate and analyze it.
In recent years, there has been a boom in the use of virtual repositories—“data clouds”—as a way to integrate and improve access to research data. Many of these efforts are still in their early stages, and questions remain about the optimal way to organize and coordinate clouds and their use.
NCI’s Role in Bioinformatics
The National Cancer Informatics Program (NCIP), part of NCI’s Center for Biomedical Informatics and Information Technology (CBIIT), oversees the institute’s bioinformatics-related initiatives.
NCIP is involved in numerous research areas, including genomics and clinical and translational studies, and how to improve data sharing, analysis, and visualization. For instance, NCIP operates NCIP Hub, a centralized resource designed to create a community space to promote learning and the sharing of data and bioinformatics tools by cancer researchers. NCIP Hub itself is an experiment to see if the cancer research community finds the social and community aspects of the program useful for team science and multi-investigator research teams.
Under The Cancer Genome Atlas (TCGA), a research program that was supported by NCI and the National Human Genome Research Institute, researchers have conducted comprehensive molecular analyses of more than 11,000 patients using tumor and healthy tissue samples. More than 1,000 studies have been published based on TCGA-collected data.
Similarly, under NCI’s TARGET program, researchers have identified genetic alterations in pediatric cancers, most of which are from children in clinical trials conducted by the Children’s Oncology Group.
Data from these two initiatives and other NCI-supported studies have helped researchers better understand the biology of different cancers and identify potential new targets for therapies.
In some respects, however, these studies have only scratched the surface of what can be learned from the vast amount of data collected as part of this research. As a result, there has been a new push in the research community to find ways to make these data, and the tools to analyze them, more widely accessible.
Democratizing Big Data
As a federal agency, NCI is uniquely positioned to democratize access to cancer research data. The institute has launched several initiatives to provide researchers with easier access to data from TCGA, TARGET, and other NCI-funded research, and the resources to analyze the data.
Initiated in late 2014, the NCI Genomic Data Commons (GDC) will provide a single source for data from these initiatives and cancer research projects and the analytical tools needed to mine them. NCI is creating the GDC following a recommendation by the Institute of Medicine for a centralized “knowledge system” for cancer.
When it is publicly launched in mid-2016, the GDC will include data from TCGA and TARGET. However, it will be expanded over the next several years to include data from cancer research projects conducted by individual researchers or research teams. The GDC will provide the cancer genomics repository for projects falling under the NIH Genomic Data Sharing Policy.
NCI is also funding several pilot programs that will use cloud technology to provide researchers with access to genomic and other data from NCI-funded studies. These NCI Cancer Genomics Cloud Pilots will be used to explore innovative methods for accessing, sharing, and analyzing molecular data.
Each pilot, implemented through commercial cloud providers, will operate under common standards but have distinct designs and means of sharing data and analytical tools, with the goal of identifying the most effective means for using cloud technology to advance cancer research.