Cancer Informatics: Expanding Access to Cancer Genomics Data

December 2, 2015, by Amy E Blum, M.A.

On Tuesday, November 10, 2015 Dr. Louis M. Staudt, M.D., Ph.D., Director of NCI’s Center for Cancer Genomics, and Dr. Warren Kibbe, Ph.D., Director of the Center for Biomedical Informatics and Information Technology (CBIIT) held a live discussion moderated by Anthony Kerlavage, Ph.D., Chief of the CBIIT Center for Cancer Informatics Branch. In the Google Hangout on Air, titled “Cancer Informatics: Expanding Access to Data,” Drs. Staudt and Kibbe discuss the NCI’s major initiatives to improve both the accessibility and usability of cancer genomics data.

Top among the NCI’s priorities are the Genomic Data Commons, a service that will bring together and harmonize cancer genomic data sets from NCI programs like TCGA and TARGET as well as from non-NCI data submitters, and the Cancer Genomics Cloud Pilots, cloud computing infrastructure that will co-locate secure data access with computational capacity and analysis tools to democratize access to genomic data for the cancer research community. Watch the Google Hangout with Drs. Staudt and Kibbe to learn more about the NCI’s initiatives below:

The NCI also opened the discussion to the public, encouraging viewers to ask questions on Twitter using the hashtag, #AskNCI. The questions that were not answered live are answered below:

Q: What exactly does harmonize the data mean?

A: Data Harmonization refers to the process by which data sets that were originally compiled using different protocols, standards, and software, are reconfigured with a common bioinformatics pipeline to be compatible with each other. Previously, data sets generated by projects such as TCGA and TARGET could not be directly compared. Harmonizing the data that is uploaded to the GDC will allow researchers to access all of the NCI-generated data together in a unified format and to perform analyses across these data sets.

Q: Can you turn all the TCGA RNAseq data into wiggle tracks? Never understood why this wasn't done (contrast with ENCODE for example).

A: For RNAseq, TCGA provides both raw sequence data, such as FASTQ and BAM files, and higher level data, such as transcript and gene-level expression. The raw sequence and higher level data are used for different types of users, and the WIG file, or wiggle track, is somewhere in between. However, WIG files can be generated from BAM files using simple open source tools. For DNAseq, this has been partially done.

To offer this information to researchers, the GDC will provide a “BAM slicing” capability, open to those who can access controlled-access data through dbGaP authorization. With this function, a user can “slice” a portion of a BAM file, which includes both sequence and read coverage information, providing more data than would be found in a WIG file. The BAM slice can then be used to generate a WIG file if the researcher chooses. In addition, the GDC team is also considering adding WIG files to the GDC for those users who work exclusively with open-access data.

Q: Cell lines are awesome but clinical trials and survival endpoints are many times more valuable. Any push within NIH/NCI to encourage release of usable clinical or response covariates with datasets?

A: All of the clinical data for all of the NCI- generated datasets produced to date are already released. Going forward, the new CCG initiatives will use tissue sample sets with thorough clinical annotation, mostly derived from completed clinical trials, therefore substantial clinical information will accompany the molecular data. CCG programs with robust clinical components include the Clinical Trials Sequencing Program (CTSP), Exceptional Responders (ER), ALCHEMIST, and others. The clinical and accompanying molecular data for these programs will be uploaded to the GDC, enabling the cancer research community to investigate heretofore unresolved clinical questions.

Q: Really excited about accessible clinical trial data. How can we use that data to improve our trials? Can we self-improve?

A: As mentioned above, one very important goal of the GDC is to provide data that relates genetic events to clinical outcomes. With accessible clinical data that accompanies molecular data, the GDC will aim to help the cancer research community uncover molecular traits that contribute to patients’ differential responses to therapy, likelihood of recurrence, and clinical outcome. CCG projects addressing these questions may improve the design of future clinical trials by facilitating the inclusion of new information about the molecular basis of clinical outcomes.

Cancer Informatics: Expanding Access to Cancer Genomics Data

Archive