Over 44,000 AACR Project GENIE Cases Available in the GDC

December 11, 2019, by Louis M. Staudt, M.D., Ph.D.

Credit: The American Association for Cancer Research

NCI’s Genomic Data Commons (GDC) has released data for 44,756 cancer cases from American Association for Cancer Research's Project Genomics Evidence Neoplasia Information Exchange, more simply known as AACR Project GENIE. This massive project was launched in 2015 with the goal of building an international, pan-cancer registry with tens of thousands of patients to empower precision oncology.

The urgent need for broad data sharing in the cancer research community spurred the AACR, along with eight global academic leaders in clinical cancer genomics, to initiate AACR Project GENIE. By making the data available in the GDC, we’re making the data available to researchers in more ways, further expanding the utility of the data and potential impact of the project.

The data released in the GDC covers 294 unique cancer types, including many cases of rare bone and soft-tissue cancers new to the GDC. The contribution has more than doubled the number of cases in the GDC.

Harmonizing Across Different Labs and Platforms: A Technical Feat

More than a dozen targeted-sequencing platforms used by 8 institutions across 4 different countries are represented in the batch of data released at the GDC. And those numbers are increasing, as GENIE continues to add participating institutions and their respective cases.

I commend AACR for the huge amount of work devoted to harmonizing these diverse datasets from a broad collection of institutes. Bringing together data produced from different physical platforms and processed through different analytic methodologies requires heavy lifting, but the work is necessary to gain the numbers and diversity needed to comprise a valuable dataset.

Much of the harmonization work was done prior to integration into the GDC, by AACR Project GENIE in collaboration with their partners at Sage Bionetwork, led by Kristen Dang and Thomas Yu, and at Memorial Sloan Kettering Cancer Center (MSKCC), led by Stacy Thomas.

“We made sure to sit down and work with the original contributing sites to develop a common data model utilizing existing standards and definitions where possible,” says Jocelyn Lee, Lead Project Manager of AACR Project GENIE. “This was a key step in making a project of this magnitude work.”

Bringing Together Data Models for Integration into the GDC

Mapping between clinical vocabularies is known to be a particularly difficult task. But having done the legwork of establishing a well-thought-out model, AACR Project GENIE and their partners were able work with the GDC team to map GENIE data to the GDC data model with relative ease.

“Several cases were straightforward and mapped one-to-one, such as mapping values for sex, race, and ethnicity,” explained Kristen Dang, who is a Principal Scientist at Sage Bionetworks. “And in other cases, such as assay-level details, we had to go back and collect more structured information from GENIE or have the GDC create new data elements.”

As for genomic data, mapping processed data from GRCh37 to the GRCh38 reference human genome was essentially all that was required. Again, the relative simplicity of this integration was thanks to the harmonization work already done by AACR Project GENIE, Sage Bionetworks and MSKCC.

Greater Numbers for Finding Rarer Driver Mutations

This dataset is extremely large—about three times the number of cases collected by TCGA, NCI’s flagship cancer characterization program. Large sample numbers will enable the discovery of rare mutations in cancer drivers that are recurrent in a small percentage of patients. Such mutations would otherwise be ignored or hidden, but because they are recurrent, they can be implicated as pathogenic.

For example, the AKT1-E17K mutation is only present in 3–5% of all cancers. AKT inhibitors are showing promise in estrogen receptor positive (ER+) breast cancers with this mutation—a mere 4% of the ER+ breast cancer population. GENIE is enabling researchers to find more of these rare mutations, analyze their clinicopathologic features, and inform the drug development process.

Studies like these will sharpen our precision medicine clinical trials going forward, as it is critical to be able to distinguish between driver mutations and passenger mutations that are functionally irrelevant.

Understanding Oncogenes Across Different Contexts

The many distinct tumor types represented in the GENIE registry might also provide clues to how the same oncogenes function in different cellular contexts. For example, a recent study suggested that mutations in the prototypical oncogene BRCA may be biologically neutral in many cancer types outside of breast and ovarian.

We may find that the different cancer types contained in the GENIE registry recurrently acquire different mutations in the same oncogenes, and our understanding of this behavior will have implications for how we test and treat each patient.

We look forward to the many exciting, clinically actionable findings that are imminent to come from the GENIE dataset. Congratulations to AACR and their partners for their continued progress on this project and their efforts to promote cancer data sharing as an engine for cancer research.

Dr. Staudt serves on the external Scientific Advisory Board of the AACR Project GENIE.