Data Quality for LLMs: Building a Reliable Data Foundation

Data Science Seminar Series

April 24, 2024 | 11:00 AM – 12:00 PM

Virtual

If you use large language models (LLMs) in your cancer research, register for this seminar to hear Elucidata’s Dr. Abhishek Jha discuss how data quality impacts LLM performance.

A reliable foundation that is well annotated and accessible to an LLM plays a major role in the value of its results.

You’ll see examples of how LLM-powered artificial intelligence (AI) agents query across three versions of the same gene expression corpus with differing results, including:

unstructured data from the public repository Gene Expression Omnibus.
structured data from the Crowd Extracted Expression of Differential Signatures project (tool developed by the Ma’ayan Lab at the Icahn School of Medicine at Mount Sinai).
clean, linked, and harmonized data.

About the Speaker

Abhishek Jha, Ph.D.
Dr. Jha is Co-founder and CEO of Elucidata, a company focused on data management and machine learning operations for life sciences research. He previously worked at Agios Pharmaceuticals.

About the Data Science Seminar Series

CBIIT’s Data Science Seminar Series is dedicating its 2026 events to spotlighting the use of AI in cancer research and care. Brought to you by CBIIT and NCI's Division of Cancer Treatment and Diagnosis AI working group, the upcoming webinars will explore a variety of questions, such as the following:

How can AI be used for diagnosis, treatment, or omics research?
What are the related laws and ethical considerations for AI?
How can we empower an AI-ready cancer research community through workforce development, collaborations, and funding?

To view upcoming speakers or recordings of past presentations, visit the Data Science Seminar Series page.