Submission

introduction

title

Knowledge Graphs and LLMs for drug repurposing

short description

Using Public Knowledge Graphs to Guide LLMs in Generating Novel, Hallucination-Free Therapeutic Mechanisms for Approved Rare Disease Drugs

Phase 1 Submission Form

Overview / Abstract

Large Language Models (LLMs) have recently demonstrated remarkable performance in general tasks across various fields, including generating human-like scientific hypothesis. Knowledge graphs store scientifically established relationships between biomedical entities (e.g. genes, diseases, drugs, biological processes) and can be used to develop probabilistic models to predict new relationships including disease treatment suggestions. However, such models have poor interpretability, and their predictions lack the expressiveness of human reasoning. The knowledge graph predictions and graph relationships can be used to construct factual prompts that will guide LLM to generate hallucination free output.

With over 7,000 known rare diseases affecting nearly 10% Americans, safe and effective treatments exist for only a few hundred rare diseases. This statistic reflects the significant unmet medical needs within the rare disease community

Secondary Analysis: Research Aims

The drug treatment prediction model will be built using the PrimeKG knowledge graph, which is available from the PrimeKG Dataverse repository. PrimeKG captures comprehensive data on 17,080 diseases and 4,050,249 relationships across major biological scales, including diseases, drugs, genes, proteins, exposures, phenotypes, side effects, molecular functions, cellular components, biological processes, anatomical regions, and pathways. The TxGNN algorithm will be employed for predicting drug treatments using the PrimeKG graph. Graph prediction paths (e.g. 'Ehlers-Danlos syndrome, classic type -> disease_disease -> autosomal dominant disease -> disease_disease -> multiple endocrine neoplasia -> disease_disease -> familial thyroid carcinoma -> indication -> Liothyronine') will be constructed with the TxGNN’s Explainer module. For rare diseases without approved treatments, top drug predictions, along with associated graph prediction paths, will be selected (preliminary disease-drug pairs disease_drug_candidates.csv in the supplemt). The graph prediction paths will be filtered and aggregated to have one prediction explanation per disease-drug pair. The prediction explanations will be used to construct search queries for biomedical publications. The Semantic Scholar API will be utilized to retrieve relevant literature (abstracts or full-text PDFs when available), which will be locally stored. Additional text analysis, including re-ranking and contextual summarization, will be performed to highlight the most pertinent statements from the publications. These statements will then be used to generate therapeutic hypotheses through large language models (LLMs). The LangChain framework will orchestrate literature search, text analysis, and LLM-driven hypothesis generation, with OpenAI’s GPT-4 model being used for the final hypothesis writing.

The project is expected to follow this timeline: 0.5 months to build the graph drug treatment prediction model and prediction paths, 1 month to develop the literature search and hypothesis generation pipeline using retrieval-augmented generation (RAG), and 0.5 months to clean and document the analysis code for reproducibility.

GREI Repository Data Sets

Dataverse

DOI (Digital Object identifier) of GREI Repository Dataset

doi.org/10.7910/DVN/IXA7BM

Outcomes and Outputs

Project Outcome:

For 20 rare diseases that currently lack treatment options, we will identify potential treatments using FDA-approved drugs that are already used for other conditions. Based on preliminary analyses of disease-drug treatment predictions, we expect to suggest an average of three drug treatment options per disease. For each suggested drug, we will provide a one-page summary explaining its therapeutic mechanism, supported by scientific references (example in the supplement).

Sharing Plan:

All therapeutic mechanism summaries, supporting scientific publications, graph model predictions, and the accompanying Python code will be made openly available through a Zenodo repository. Additionally, a GitHub repository will be created. These resources will serve as a project milestone and will allow other researchers to explore the findings, replicate the analysis, and generate their own therapeutic summaries using the provided data and code.

FAIR and CARE Principles:

Findable: All outputs (summaries, code, predictions) will be deposited in the Zenodo repository, ensuring they are indexed, citable, and easily discoverable by other researchers.
Accessible: The repository will be openly accessible without restrictions, ensuring that all stakeholders can download and use the resources.
Interoperable: The accompanied GitHub repository will have installation requirement configuration file that ensure compatibility of the project Python code across various platforms
Reusable: The Python code and supporting publications will provide sufficient information for researchers to regenerate therapeutic mechanism summaries, even though the outputs of the LLM (which are stochastic) cannot be identically reproduced.

In cases where CARE principles (Collective Benefit, Authority to Control, Responsibility, Ethics) apply—the project will ensure data sovereignty and ethical considerations are maintained by following open access and citation practices while respecting community rights and input in the research process.

Impact/ Scientific Significance

This project will generate therapeutic hypotheses that address significant unmet medical needs within the rare disease community, paving the way for the development of novel therapies. To our knowledge, this is the first attempt to generate therapeutic hypotheses under the constraints of a knowledge graph, which will enhance the factual accuracy of Large Language Model (LLM)-generated hypotheses. By improving the reliability of scientific insights produced by LLMs, this approach opens new avenues for leveraging LLMs in pharmaceutical drug discovery, advancing both rare disease research and drug development methodologies.

Team

Our team at X-Data LLC came together with a shared passion for transforming biomedical data into actionable insights. Founded in May 2024, we bring deep expertise in data analytics tailored specifically for the Life Sciences. Vladimir Morozov, our founder, has led data science initiatives for 25 years in the pharmaceutical and biotech industries, ensuring we understand the complexities of biomedical data. Feodor Morozov, a biomedical engineering student at the University of Illinois, Urbana-Champaign, adds fresh academic and hands-on experience from internships at the University of Chicago and MIT.

Our combined experience includes 25 years of biomedical data analysis, with proficiency in R (20 years) and Python (5 years). We have a strong foundation in public cloud platforms, with 10 years of engineering and deployment on Google and Azure. Additionally, we bring 5 years of expertise in natural language processing (NLP) research and have recently developed cutting-edge retrieval-augmented large language model (LLM) applications.

Considerations

To effectively suggest FDA-approved treatments for rare diseases, the graph training data must cover a broad range of diseases, drugs, and biological processes. A target disease should have a pathological mechanism information, which should align with clusters of disease mechanisms that have existing drug treatment options. All this information must be captured within the knowledge graph. If additional funding is available, exploring other public knowledge graphs could further enhance prediction accuracy.

Constructing effective search queries based on graph predictions remains an open challenge. Advanced text analysis methods, such as re-ranking and contextual summarization, must be implemented to extract the most meaningful statements from retrieved literature, ensuring that the LLM receives high-quality input for generating therapeutic hypotheses. With our experience in retrieval-augmented LLM generation, we are confident in overcoming these challenges effectively.

Supporting Documents

Provide up to 10 resources for the evaluation of your secondary research project including but not limited to: ● The persistent identifier of the dataset(s), other than GREI dataset DOIs already listed above, to be used in the proposed project (where available) ● Tools/workflows or resources to be utilized in the proposed project ● Relevant references or scientific publications that directly relate to the proposed project

Supporting Document (1)

https://www.fda.gov/patients/rare-diseases-fda

Supporting Document (2)

https://github.com/mims-harvard/TxGNN.git

Supporting Document (3)

https://doi.org/10.48550/arXiv.2409.05556

Supporting Document (4)

https://1drv.ms/u/s!AnsrkgqF_x7gioJktOXbmP4tOnx4TQ?e=ZVkSrN

Supporting Document (5)

https://1drv.ms/u/s!AnsrkgqF_x7gioJlpCrjM4JqKsU1Ig?e=MFdY8p

Supporting Document (6)

https://github.com/langchain-ai/langchain

Non Scored Criteria

Please complete this information. It will not be scored by the evaluation panel.

Entity Participation

Participate as an Entity (i.e., registering as a group of individuals competing together on behalf of a legally established organization, institution, or corporation)

Legal Entity Organization Name

X-Data LLC
30 N Gould St Ste R
Sheridan, WY 82801

Research Discipline (non-scored criteria)

Large Language Models
Graph learning
Natural-language processing

IDeA State (non-scored criteria)

All Team Member Information - Name, Organization, Job Title, and Email address

Vladimir Morozov, X-Data LLC, CEO , finance@x-data.ai
Feodor Morozov, X-Data LLC, scientist, contact@x-data.ai

MSI (non-scored criteria)

Participation in prior DataWorks! Prizes (non-scored criteria)

Team Point of Contact Eligibility

yes

Eligibility (non-scored criteria)

Yes, I confirm that I have read and meet the terms of eligibility for this challenge

Was this page helpful? yes no