Submission

introduction

title

Designing RAG Agents for Reuse of GREI Datasets

short description

We implement a RAG based semantic search of DataCite meta data that produces relevant GREI datasets based on a research abstract prompt.

Phase 1 Submission Form

Overview / Abstract

Data reuse is essential for advancing biomedical research, but finding relevant datasets can be time-consuming due to inherent complexities, such as poor indexing within large repositories. To address this, we introduce "RAGREI," a system that employs an agentic approach with Retrieval-Augmented Generation (RAG) techniques to streamline data discovery. Researchers can submit a research abstract to RAGREI, which then acts as an intelligent agent that analyzes the abstract to extract key details, performs targeted searches across GREI repositories using DataCite metadata, and identifies relevant datasets. RAGREI also generates templates for important documents, such as Data Management and Sharing Plans, and suggests training and validation splits that ensure demographic diversity. By simplifying data discovery and supporting compliance with NIH data policies, RAGREI accelerates the reuse of data, driving scientific innovation and improving health outcomes.

Secondary Analysis: Research Aims

Project Description:
We will develop and evaluate an LLM-agentic approach that uses Retrieval-Augmented Generation (RAG) technique to automate biomedical dataset retrieval from the Generalist Repository Ecosystem Initiative (GREI) repositories. Once a user submits a research abstract, our system will extract medical lexicon and compare it with other GREI manuscript abstracts through a vector semantic search, and recommend relevant datasets. In addition our system will also generate documentation like data use requests. A key benefit is its ability to suggest supplementary datasets, enabling researchers to test whether findings generalize to data drawn from different distributions and demographic groups not in the original study. For instance, it can recommend pediatric datasets for oncology studies focused on adults, allowing researchers to assess if findings apply to younger patients.

Data Utilized:
We will use dataset metadata from GREI repositories, focusing on DataCite metadata to link abstracts with associated datasets. GREI hosts extensive biomedical datasets, including clinical and epidemiological data. The project will process thousands of metadata entries from reused datasets, ensuring comprehensive data coverage.

Incorporating GREI Data:
The RAG model will leverage DataCite metadata to link abstracts to relevant GREI datasets. Using research abstracts as inputs and datasets as outputs, the system will automate dataset retrieval. It will also recommend supplementary datasets with demographic splits, broadening research scope. For example, a study on adult oncology could be enriched with pediatric datasets to test if findings generalize across age groups.

Methods and Analysis:
We will fine-tune an off-the-shelf LLM, such as Meditron or OpenBioLLM, using unsupervised methods like RadGRAPH to optimize its ability to interpret DataCite metadata. The model will conduct semantic searches of GREI repositories based on abstracts, retrieving datasets and generating compliance-ready documentation. A core feature is recommending diverse demographic datasets, enabling researchers to validate findings across underrepresented groups.

Timeline:
The project will be completed within 6 months. In month 1, we will select the baseline model and prepare GREI metadata. Over the next 3 months, we will fine-tune and train the model. In month 5, we will validate the model using GREI datasets, followed by a case study on the generalizability of oncology findings across unstudied demographic groups.

Benefit to Research:
This project enhances the generalizability of biomedical research by recommending supplementary datasets, allowing researchers to validate findings across different demographic groups. Streamlining dataset discovery and expanding analysis to include underrepresented populations will advance personalized medicine and promote inclusive research outcomes.

GREI Repository Data Sets

Dataverse
Dryad
Figshare
Mendeley Data
Open Science Framework (OSF)
Vivli
Zenodo (CERN and Northwestern University)

DOI (Digital Object identifier) of GREI Repository Dataset

10.7937/4QAD-4280, 10.5281/zenodo.10728780, 10.21227/fehk-6b77, 10.5281/zenodo.10728780, 10.7937/9JER-G980, 10.7937/gkr0-xv29

Outcomes and Outputs

Outcomes and Outputs

Our project will generate two key outcomes: (1) research artifacts and findings that enhance Retrieval-Augmented Generation (RAG) models for biomedical dataset retrieval, and (2) increased reuse of GREI datasets through the deployment of the RAGREI tool. Both outcomes will be actively shared with the research community to promote data reuse, improve replicability, and adhere to FAIR and CARE principles.

Research Findings and Expected Outcomes
We will produce a comprehensive study evaluating RAGREI’s performance in retrieving relevant datasets from GREI repositories based on research abstracts. Key metrics, such as search precision and recall (informed by recent developments in RAG Benchmarking such as Chen et al. 24), will be reported, and a domain-specific analysis in pediatric oncology will examine 50-100 validation abstracts to assess generalizability across sub-populations. Increased reuse of GREI datasets is expected through the RAGREI tool, which automates dataset retrieval and documentation generation, lowering barriers to data access and facilitating broader scientific discovery.

Dissemination Strategy
We will submit our findings as a manuscript to high-impact conferences, such as the datasets and benchmarks track of NeurRIPS or AAAI. The manuscript will detail our evaluation of RAGREI, focusing on its ability to expand research to new populations, such as pediatric oncology. Additionally, we will make RAGREI freely available via a web portal through an easy to use interface (GUI) similar to ChatGPT or open-source codebase, with comprehensive documentation and use case examples to encourage adoption. We will also demonstrate RAGREI at conferences, engaging directly with the research community to showcase the tool's practical applications and gather feedback.

FAIR and CARE Principles
Our project will ensure datasets and tools are findable, accessible, interoperable, and reusable by leveraging DataCite metadata and integrating semantic techniques like RadGRAPH. When applicable, we will adhere to CARE principles, especially for sensitive datasets, by respecting ethical standards and community values.

Replicability and Reproducibility
To promote replicability, all code, models, and data processing pipelines will be released as open-source software. This transparency will allow researchers to replicate our study, apply RAGREI to their own datasets, and contribute to further improvements. Thorough documentation of datasets, model configurations, and evaluation methods will ensure ease of replication and reproducibility.

Benefit to Research
By recommending supplementary datasets, this project will enhance the generalizability of biomedical research, allowing researchers to validate findings across diverse demographic groups. Streamlining dataset discovery and expanding analyses to include underrepresented populations will advance personalized medicine and promote inclusive research outcomes

Impact/ Scientific Significance

Impact / Scientific Significance

The impact of RAGREI goes well beyond a single study that reuses one or several datasets. We are proposing a highway that rapidly connects healthcare research teams to data relevant to their research questions. The problem we address is that browsing the many GREI datasets for relevant data and drafting the required request documentation is cumbersome, often requiring multiple search sessions. This hampers the reuse of data that could hold key information for scientific discovery and has, to date, prevented many research groups from benefiting from GREI data.

RAGREI greatly lowers the barrier to entry for a broad range of research groups in multiple disciplines, allowing them to make greater impacts in diagnosis, treatment, and prevention. By enabling the efficient use of GREI datasets, RAGREI fosters the expansion of scientific knowledge across healthcare domains.

Contributions to Scientific Disciplines
RAGREI’s primary contribution is in facilitating the reuse of GREI datasets across multiple biomedical fields. This tool automates complex, time-consuming data searches, which currently act as a bottleneck for research progress in fields like genomics, oncology, and epidemiology. Researchers can leverage RAGREI to discover relevant datasets and integrate them into their work, accelerating discovery and expanding the scope of their studies.

Example Case Study: Pediatric Oncology
We will demonstrate this potential in our team's domain of medical science expertise, pediatric oncology. Our proposed 50-100 abstract validation will be used to select a study of interest with multiple additional datasets with an expanded range of demographics. Specifically we will use RAGREI to explore the generalizability of our existing paediatric clinical oncology models to adult populations for lesion detection (sarcoma) and diagnostic tasks (SUV/Deauville scoring) where RAGREI will indicate the relevant files (we know apriori to exist) on Dataverse for this work. Recent studies have shown that state of the art image based clinical oncology models do not perform well on paediatric cohorts. By demonstrating that RAGREI suggests the relevant datasets to expand demographic coverage of our in-house models, we can demonstrate to the community a tangible example of the benefits it promises

Broader Impact on Human Health
RAGREI’s ability to streamline data reuse across diverse populations could have a significant impact on human health. By facilitating easier access to GREI datasets, RAGREI democratizes data usage and empowers research groups with fewer resources to undertake large-scale, impactful studies. Its functionality enables a wide range of researchers to incorporate datasets that represent real-world populations, leading to diagnostic, treatment, and preventive measures that are applicable across demographic groups

Team

Our team is highly experienced in statistical analysis and machine learning, making us well-equipped to ensure the success of RAGREI.

Michael Joseph Barrow, PhD (Team Lead), has over 10 years of experience in AI and machine learning with a focus on healthcare applications. His expertise in data-driven modeling, neural networks, and regression techniques will be instrumental in fine-tuning the LLM and handling complex medical data. At Stanford, Dr. Barrow has successfully developed ML models for rare cancer diagnosis and outcome prediction

Kowshik Thopalli, PhD, is a specialist in domain generalization, having published research on enhancing the robustness of machine learning models under statistical domain shifts. His deep knowledge in computer vision, counter-factual reasoning and large-language models will strengthen our approach to analyzing diverse datasets from the GREI repositories

Dr. Heike Daldrup-Link, MD, PhD, a leading figure in pediatric radiology at Stanford, brings invaluable clinical insights to the project. Prof. Daldrup-Link's expertise in translating medical data into actionable findings will ensure that RAGREI produces clinically relevant and generalizable recommendations

Considerations

To ensure the success of the proposed research project, several key considerations must be addressed:

Risk of Bias: The model may introduce bias due to varying levels of detail in dataset metadata. To mitigate this, we will adjust the model to account for metadata completeness when ranking datasets, enrich metadata through automated techniques, and continuously monitor outputs for fairness, ensuring diverse and representative dataset recommendations.

Hallucination: Even with relevant documents retrieved by RAG, LLMs can produce content that wasn't explicitly stated in the retrieved text. This would mislead the researchers and waste their time. To address this we will do post-hoc checks along with implementing hallucination detection techniques.

Outdated/Static Information: Since RAG relies on the constructed vector database of abstracts and meta-data, we will frequently update this database so as to ensure that our responses reflect the latest knowledge.

Supporting Documents

Provide up to 10 resources for the evaluation of your secondary research project including but not limited to: ● The persistent identifier of the dataset(s), other than GREI dataset DOIs already listed above, to be used in the proposed project (where available) ● Tools/workflows or resources to be utilized in the proposed project ● Relevant references or scientific publications that directly relate to the proposed project

Supporting Document (1)

https://docs.google.com/document/d/1cnzF1b0WLfy9-_9zWdvAsg6YBkT_JLx1ogjDdn4YbTk/edit?usp=sharing

Supporting Document (2)

https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B

Supporting Document (3)

https://arxiv.org/abs/2311.16079

Supporting Document (4)

https://ojs.aaai.org/index.php/AAAI/article/view/29728

Supporting Document (5)

https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/c8ffe9a587b126f152ed3d89a146b445-Paper-round1.pdf

Supporting Document (6)

https://ajronline.org/doi/full/10.2214/AJR.24.31076

Supporting Document (7)

https://arxiv.org/abs/2408.00331

Supporting Document (8)

https://www.arxiv.org/abs/2409.03946

Supporting Document (9)

https://arxiv.org/abs/2001.10717

Non Scored Criteria

Please complete this information. It will not be scored by the evaluation panel.

Entity Participation

Participate as an Entity (i.e., registering as a group of individuals competing together on behalf of a legally established organization, institution, or corporation)

Legal Entity Organization Name

Board of Trustees of the Leland Stanford Junior University,
450 Jane Stanford Way, Stanford, CA 94305-2004

Research Discipline (non-scored criteria)

Retrieval Augmented Generation
Scientific Machine Learning
Pediatric Oncology
Clinical Outcome Prediction
Computer Vision

IDeA State (non-scored criteria)

All Team Member Information - Name, Organization, Job Title, and Email address

Michael Joseph Barrow, Stanford University, Research Engineer, mjbarrow@stanford.edu
Kowshik Thopalli, Lawrence Livermore National Laboratory, Postdoctoral Researcher, thopalli1@llnl.gov
Heike Daldrup-Link, Stanford University, Professor, heiked@stanford.edu

MSI (non-scored criteria)

Participation in prior DataWorks! Prizes (non-scored criteria)

Team Point of Contact Eligibility

yes

Eligibility (non-scored criteria)

Yes, I confirm that I have read and meet the terms of eligibility for this challenge

Was this page helpful? yes no