From Ignoring Features to Machine Learning Features: Computational Biology Then and Now
, by Peggy I. Wang
It’s no question that the computational biology field has changed immensely since The Cancer Genome Atlas (TCGA) began in 2006. From data management to the analysis and biological interpretation of data, this field has undergone a dramatic transformation.
“TCGA has been a 12-year odyssey. The program started before we had a fully mature set of technologies and methods,” says Dr. John Weinstein, the head of the Department of Bioinformatics and Computational Biology at The University of Texas MD Anderson Cancer Center and one of the original members of the TCGA Research Network.
Over the course of the program, while researchers were migrating from microarrays to one vintage of sequencing platform to another, they were also collectively building and shaping an entirely new field. Dr. Weinstein further explains how computational biology has evolved.
Evolved (Virtual) Work Environments
At the program’s start, a major part of TCGA’s work was learning to manage data sets of increasing size and complexity while coordinating controlled access for hundreds of individuals at different locations. We can get a glimpse of what a computational biologist’s workspace might have looked like in 2006 by digging into the supplementary methods section of TCGA’s first publication on glioblastoma. The first rendition of the data portal is described, along with file directory structures and procedures for searching and downloading data.
Flash forward a dozen years, there are now a myriad of workspace options for all researchers, such as the NCI’s Genomic Data Commons, which identifies and implements best-in-class bioinformatic pipelines developed by TCGA and others. More hands-on options include supercomputing centers available at different institutions and several different web- and cloud-based resources.
Many More Tools in the Toolshed
The field has exploded with tools for analyzing different types of data for various purposes. While a computational biologist in 2006 might have had to hack together their own alignment tools and establish their own bioinformatic pipelines, running an analysis today is more akin to shopping for a salad dressing at a supermarket: endless flavors from different brands, a myriad of varieties and styles, and of course, the option to make your own.
The increase in tools has changed the way researchers can explore and visualize data. For example, clustered heatmaps, once a static image, are now interactive and link to external resources or invoke other tools.
While the great number of tools developed reflects the remarkable upward movement in the field, whittling down to the “right” tools presents its own challenges.
“In some cases, the community develops a consensus of the right tools to use, at other times, the tools developed first or put out by trusted institutions gain favor. Other times, contests such as DREAM help establish a winner,” explains Weinstein.
Looking More Closely at the Biology
“There are a number of things we had to ignore about the biology because we couldn’t deal with them,” says Weinstein. Now, with more sophisticated methods and higher quality data, we’re able to address some of those issues.
For example, samples were treated as if they were homogeneous groups of cells, rather than a mix of tumor, normal, immune and other cell types. Tumor cells themselves consist of a mix of cancer sub-clones.
Methods for dealing with heterogeneity developed recently have enabled deconstruction of data into cellular elements. In fact, the analysis of many more prostate, pancreatic, and bladder cancers was possible because such methods allowed TCGA to lower the minimum tumor purity requirement imposed on samples from 80% to 60%.
Single-cell genomic and proteomic technologies are another recent advancement allowing researchers to look in more detail at what’s happening within an individual tumor. The new technologies have in turn spawned new methods with updated statistics for analyzing single- rather than bulk-cell data.
Dealing with Ever Growing Data
As the size and quantity of data sets continues to grow and the complexity of measurements continues to increase, even more computational techniques will be necessary.
One response has been an increasing emphasis on machine learning techniques. For example, we are seeing deep learning applied to cancer imaging features to predict gene expression or mutations. Others are using different classifiers to identify mutations key to particular phenotypes or pathway activity.
As the community continues to grow and advance with the data, we’ll be able to ask even more interesting questions of biological and clinical relevance.
For Weinstein, the progress has already been remarkable. “Even if TCGA hadn’t produced a single data point of interest, what it did to crystallize, grow, and mature an expansive bioinformatics community would already have been an enormous contribution.”