Large Language Models (LLMs) have recently demonstrated remarkable performance in general tasks across various fields, including generating human-like scientific hypothesis. Knowledge graphs store scientifically established relationships between biomedical entities (e.g. genes, diseases, drugs, biological processes) and can be used to develop probabilistic models to predict new relationships including disease treatment suggestions. However, such models have poor interpretability, and their predictions lack the expressiveness of human reasoning. The knowledge graph predictions and graph relationships can be used to construct factual prompts that will guide LLM to generate hallucination free output.
With over 7,000 known rare diseases affecting nearly 10% Americans, safe and effective treatments exist for only a few hundred rare diseases. This statistic reflects the significant unmet medical needs within the rare disease community
The drug treatment prediction model will be built using the PrimeKG knowledge graph, which is available from the PrimeKG Dataverse repository. PrimeKG captures comprehensive data on 17,080 diseases and 4,050,249 relationships across major biological scales, including diseases, drugs, genes, proteins, exposures, phenotypes, side effects, molecular functions, cellular components, biological processes, anatomical regions, and pathways. The TxGNN algorithm will be employed for predicting drug treatments using the PrimeKG graph. Graph prediction paths (e.g. 'Ehlers-Danlos syndrome, classic type -> disease_disease -> autosomal dominant disease -> disease_disease -> multiple endocrine neoplasia -> disease_disease -> familial thyroid carcinoma -> indication -> Liothyronine') will be constructed with the TxGNN’s Explainer module. For rare diseases without approved treatments, top drug predictions, along with associated graph prediction paths, will be selected (preliminary disease-drug pairs disease_drug_candidates.csv in the supplemt). The graph prediction paths will be filtered and aggregated to have one prediction explanation per disease-drug pair. The prediction explanations will be used to construct search queries for biomedical publications. The Semantic Scholar API will be utilized to retrieve relevant literature (abstracts or full-text PDFs when available), which will be locally stored. Additional text analysis, including re-ranking and contextual summarization, will be performed to highlight the most pertinent statements from the publications. These statements will then be used to generate therapeutic hypotheses through large language models (LLMs). The LangChain framework will orchestrate literature search, text analysis, and LLM-driven hypothesis generation, with OpenAI’s GPT-4 model being used for the final hypothesis writing.
The project is expected to follow this timeline: 0.5 months to build the graph drug treatment prediction model and prediction paths, 1 month to develop the literature search and hypothesis generation pipeline using retrieval-augmented generation (RAG), and 0.5 months to clean and document the analysis code for reproducibility.
For 20 rare diseases that currently lack treatment options, we will identify potential treatments using FDA-approved drugs that are already used for other conditions. Based on preliminary analyses of disease-drug treatment predictions, we expect to suggest an average of three drug treatment options per disease. For each suggested drug, we will provide a one-page summary explaining its therapeutic mechanism, supported by scientific references (example in the supplement).
All therapeutic mechanism summaries, supporting scientific publications, graph model predictions, and the accompanying Python code will be made openly available through a Zenodo repository. Additionally, a GitHub repository will be created. These resources will serve as a project milestone and will allow other researchers to explore the findings, replicate the analysis, and generate their own therapeutic summaries using the provided data and code.
In cases where CARE principles (Collective Benefit, Authority to Control, Responsibility, Ethics) apply—the project will ensure data sovereignty and ethical considerations are maintained by following open access and citation practices while respecting community rights and input in the research process.
This project will generate therapeutic hypotheses that address significant unmet medical needs within the rare disease community, paving the way for the development of novel therapies. To our knowledge, this is the first attempt to generate therapeutic hypotheses under the constraints of a knowledge graph, which will enhance the factual accuracy of Large Language Model (LLM)-generated hypotheses. By improving the reliability of scientific insights produced by LLMs, this approach opens new avenues for leveraging LLMs in pharmaceutical drug discovery, advancing both rare disease research and drug development methodologies.
Our team at X-Data LLC came together with a shared passion for transforming biomedical data into actionable insights. Founded in May 2024, we bring deep expertise in data analytics tailored specifically for the Life Sciences. Vladimir Morozov, our founder, has led data science initiatives for 25 years in the pharmaceutical and biotech industries, ensuring we understand the complexities of biomedical data. Feodor Morozov, a biomedical engineering student at the University of Illinois, Urbana-Champaign, adds fresh academic and hands-on experience from internships at the University of Chicago and MIT.
Our combined experience includes 25 years of biomedical data analysis, with proficiency in R (20 years) and Python (5 years). We have a strong foundation in public cloud platforms, with 10 years of engineering and deployment on Google and Azure. Additionally, we bring 5 years of expertise in natural language processing (NLP) research and have recently developed cutting-edge retrieval-augmented large language model (LLM) applications.
To effectively suggest FDA-approved treatments for rare diseases, the graph training data must cover a broad range of diseases, drugs, and biological processes. A target disease should have a pathological mechanism information, which should align with clusters of disease mechanisms that have existing drug treatment options. All this information must be captured within the knowledge graph. If additional funding is available, exploring other public knowledge graphs could further enhance prediction accuracy.
Constructing effective search queries based on graph predictions remains an open challenge. Advanced text analysis methods, such as re-ranking and contextual summarization, must be implemented to extract the most meaningful statements from retrieved literature, ensuring that the LLM receives high-quality input for generating therapeutic hypotheses. With our experience in retrieval-augmented LLM generation, we are confident in overcoming these challenges effectively.