Submission

introduction

title

Accessing GREI from an AI Agentic System

short description

We will create tools to access to GREI databases within an AI agentic system to replicate a virtual drug screening workflow from Dataverse.

Phase 1 Submission Form

Overview / Abstract

Large Language Models (LLM’s) are employed in scientific workflows to streamline data collection and analysis. AI agentic systems refined for specialized datasets can reduce AI hallucinations. The Molecular Analysis and Reasoning Assistant (MARA) (Nanome Inc), is a scientific discovery copilot utilizing LLM’s for workflows in biochemical informatics. MARA is capable of accessing chemical databases, visualizing and manipulating chemical structures, processing tabular data, identifying chemical analogs, downloading structural files, and visualizing molecular structures. Users can extend the functionality of MARA by building custom tools which can be collected into workflows. For instance, we built a python based tool to get the IUPAC name of a compound based on its common name using the PubChem API. Our goal to build tools to access the API’s of the GREI databases, allowing data in those repositories to be accessible in MARA.

Secondary Analysis: Research Aims

Kamakia et al (https://f1000research.com/articles/12-444) conducted a virtual screening of analogs of common nonsteroidal anti-inflammatory drugs (NSAIDs) using a variety of tools to explore the pharmacodynamic and pharmacokinetic properties of the drugs when bound to COX-1 and COX-2 proteins. In summary, the researchers submitted SMILES strings for the reference drugs to SwissSimilarity to identify analogs which were docked to COX-1 and COX-2 structures downloaded from the RCSB database. SMILES strings of the analogs were submitted to SwissADME and Protox II databases for toxicity and pharmacokinetic property determination.

MARA is an ideal platform for automating and expanding this workflow. MARA has built-in capabilities for downloading structures from RCSB and for identifying chemical analogs. Additional functionality can be built in using custom tools to access SwissDrugDesign tools such as SwissSimilarity. Our research aims are as follows:

Build a tool to access the Harvard Dataverse repository from within MARA and to download and process Dataverse data
Build tools to access and utilize the SwissDrugDiscovery or similar tools from within MARA
Create a workflow within MARA to replicate the original NSAID drug screening protocol (doi:10.7910/DVN/XMFF8N)
Expand the workflow to extend the original analysis (e.g. include other NSAIDS) or to work with any ligand-target model
Build additional tools to access other GREI repositories to allow for cross-repository comparisons with GREI and other repositories (e.g. PubChem)
Test the feasibility of using the tools and workflows to identify additional datasets from the GREI databases which can be used for secondary analysis

GREI Repository Data Sets

Dataverse

DOI (Digital Object identifier) of GREI Repository Dataset

doi:10.7910/DVN/XMFF8N

Outcomes and Outputs

The primary outcome of the work will be to create a general workflow that can replicate the types of analysis conducted in the Kamakia analysis and extend it to any similar system. The secondary outcome is to create functionality to allow for incorporation of GREI repository data into MARA workflows. This will allow for rapid LLM-assisted assessment of drug-protein interactions and comparison of results with data in GREI and other repositories. These workflows will be useful not only for analysis of new drugs but also secondary analysis of existing datasets such as may be found in the GREI databases.

A criticism of LLM’s is that they often act as black boxes. MARA outlines in detail the steps taken for each query, allowing the user to better understand the underlying logic the LLM is using. This contributes significantly to reproducibility. The analyses will follow FAIR data principles by making any tools or code developed in this project accessible to the public. Tools developed in MARA include the code/database queries used. Any tools and workflows generated will be publicly available through MARA and the RI-INBRE Molecular Informatics Core GitHub page. The use of MARA also contributes to interoperability as the secondary goal is to incorporate access to all GREI databases into MARA allow for simplified data access, comparison, and reporting and reuse of GREI datasets.

Impact/ Scientific Significance

The fields of pharmacology and in silico drug design are being rapidly transformed by the development of artificial intelligence tools. AI tools such as Deep Mind's AlphaFold have made it possible to rapidly predict structures of nearly any molecule, leading to a massive increase in the number of potential drug targets that can be analyzed. Similarly, AI agentic systems such as MARA are vital for streamlining and expanding the cheminformatic workflows necessary for large scale molecular assessment of potential drugs and their targets. These workflows are useful not only for analysis of new drugs, but also for secondary analysis of existing drugs. PubMed currently lists over 500 manuscripts related to drug repurposing and AI/ML. Given the massive size and complexity of existing biomedical databases, AI agentic systems are essential tools for such secondary analyses.

By developing tools and workflows in MARA, we can simplify and streamline the process of data collection and collation from multiple repositories including PubChem, ChEMBL, and the GREI databases. Simple pharmacological analysis workflows such as described in the Kamakia dataset can be automated and tested for accuracy, and then expanded to allow for new functionality. Once validated, tools developed for accessing other GREI repositories through MARA will allow rapid identification of other similar datasets in those repos which can be themselves analyzed. Using the extended reality functionality of the Nanome app, researchers can easily visualize and share their results in XR with collaborators in real time.

Team

Dr. Christopher L. Hemme has a PhD in Biochemistry and has been a bioinformatician for 25 years. He is currently a Research Associate Professor in the Department of Biomedical and Pharmaceutical Sciences in the College of Pharmacy at the University of Rhode Island and serves as the Director of the RI-INBRE Molecular Informatics Core (MIC). The work proposed here is in line with the goals and mandates of the MIC as outlined in the RI-INBRE grant P20GM103430.

Dr. Abdeltawab Hendawi is an assistant professor in Computer Science and Data Science at the University of Rhode Island (URI). He is the Co-director of the AI-Lab at URI. He received his PhD in Computer Science from the University of Minnesota (UMN). His research interests are centered on big data and AI with a focus on smart cities and smart health related applications. His research is sponsored by grants from NSF, TIDC, and USDA.

Considerations

The MARA analysis should be able to replicate the published analysis within the MARA environment
The MARA workflow should be completed in a timely manner (i.e. similar in time to a ChatGPT query)
The MARA workflow should be extendable to other molecules or systems
The results can be correlated or compared with similar datasets in other GREI databases (i.e. identify related figures in FigShare)

Supporting Documents

Provide up to 10 resources for the evaluation of your secondary research project including but not limited to: ● The persistent identifier of the dataset(s), other than GREI dataset DOIs already listed above, to be used in the proposed project (where available) ● Tools/workflows or resources to be utilized in the proposed project ● Relevant references or scientific publications that directly relate to the proposed project

Supporting Document (1)

https://nanome.ai/

Supporting Document (2)

https://nanome.ai/mara

Supporting Document (3)

https://web.uri.edu/riinbre/mic/

Supporting Document (4)

https://f1000research.com/articles/12-444

Supporting Document (5)

https://www.molecular-modelling.ch/swiss-drug-design.html

Supporting Document (6)

https://web.uri.edu/ai/

Non Scored Criteria

Please complete this information. It will not be scored by the evaluation panel.

Entity Participation

Participate as an Entity (i.e., registering as a group of individuals competing together on behalf of a legally established organization, institution, or corporation)

Legal Entity Organization Name

University of Rhode Island

Research Discipline (non-scored criteria)

Bioinformatics
Artificial Intelligence
Pharmacology

IDeA State (non-scored criteria)

Yes

All Team Member Information - Name, Organization, Job Title, and Email address

Dr. Christopher L. Hemme (Point of Contact), University of Rhode Island, Research Associate Professor, hemmecl@uri.edu
Dr. Abdeltawab Hendawi, University of Rhode Island, Assistant Professor, hendawi@uri.edu

MSI (non-scored criteria)

Participation in prior DataWorks! Prizes (non-scored criteria)

Team Point of Contact Eligibility

yes

Eligibility (non-scored criteria)

Yes, I confirm that I have read and meet the terms of eligibility for this challenge

Was this page helpful? yes no