Submission

introduction

title

TBScan: LLMs Prioritizing Genes in TB Pathogenesis

short description

TBScan prioritizes genes in TB pathogenesis, drug resistance, and host-genotype interactions using fine-tuned LLMs on multi-omics data.

Phase 1 Submission Form

Overview / Abstract

Tuberculosis (TB) remains a major public health concern with 10.6 million cases and 1.3 million deaths globally in 2022. TB is a complex disease influenced by immune status, living conditions, and genetic predispositions, making control efforts challenging.This project proposes to develop TBScan, a tuberculosis-specific Large Language Model (LLM), trained on transcriptomics, whole genome sequencing, and single-cell datasets. We aim to fine-tune pre-trained LLMs such as BioBERT and BioGPT using TB-specific data to prioritize genes involved in TB pathogenesis, integrating genomic insights like SNPs and host-pathogen interactions. We will benchmark these models, fine-tune them using LORA, and assess their ability to classify genes relevant to TB, such as drug resistance and virulence factors. This approach will advance TB research by improving diagnosis, treatment, and prevention through a robust gene prioritization tool by delivering a model that leverages prior knowledge and data.

Secondary Analysis: Research Aims

Data Collection:
We will scan repositories such as Mendley, Zenodo, PubMed, and GEO to identify publicly available TB datasets, including transcriptomics, WGS, and single-cell data. We aim to collect data on TB pathogenesis, host-genotype interactions, and drug resistance.

Inclusion Criteria:

Sufficient sample size for statistical power
Detailed metadata: demographic (age, sex, ethnicity), clinical (disease stage, treatment), and biological (immune status) information
Clear study designs focused on TB progression, host-genotype interactions, or drug resistance

We anticipate working with 10-20 high-quality datasets per data type, totaling over 10,000 samples. We will filter datasets based on quality control metrics, allowing up to 10% missingness. Our team has extensive experience managing heterogeneous datasets and addressing missing data, batch effects, and biological variability.

Ethical Considerations:
All data will adhere to ethical guidelines, ensuring anonymity and compliance with FAIR principles. We will avoid bias in model outputs by ensuring non-discriminatory use of demographic data and follow informed consent guidelines for publicly available datasets.

Methods:

Pretrained Model Selection: We will select pretrained LLMs (BioBERT, BioGPT, ESM) trained on disease-specific gene sets, prioritizing those adept at recognizing gene expression patterns. Pretrained LLMs offer the advantage of having already learned core biological principles across multiple conditions. Fine-tuning them with TB-specific datasets—such as transcriptomics, GWAS, and single-cell data—will enhance their ability to detect TB-related genes, pathways, and drug resistance mechanisms. This approach boosts predictive power without the need for retraining from scratch, leveraging both general biomedical knowledge and disease-specific insights and ensuring the model will capture nuances of TB pathogenesis.
Data Preprocessing: We will conduct quality control and normalization, including batch effect correction, to ensure consistency across studies before model fine-tuning.
Model Fine-Tuning with LORA: We will apply LORA (Low-Rank Adaptation) to fine-tune selected LLMs on TB-specific data, integrating transcriptomic, genomic, and demographic information to enhance gene prioritization.
Validation: Fine-tuned models will be benchmarked on TB-specific challenges, such as drug resistance and virulence factors. Cross-validation will be conducted using independent datasets, with performance measured through precision, recall, and F1 scores.

Timeline:

Months 1-3: Data collection
Months 4-6: Data preprocessing; LLM selection
Months 7-9: Model training and optimization
Months 10-12: Validation, cross-validation, benchmarking, and dissemination

GREI Repository Data Sets

Mendeley Data
Zenodo (CERN and Northwestern University)

DOI (Digital Object identifier) of GREI Repository Dataset

This is limited data of what we have gathered:
10.5281/zenodo.3402385
10.5281/zenodo.11174762
10.5281/zenodo.4552100
10.5281/zenodo.7339179
10.5281/zenodo.6386494
10.5281/zenodo.3574677
10.5281/zenodo.4318819
10.5281/zenodo.10894444
10.5281/zenodo.6176732
10.5281/zenodo.1470250
10.5281/zenodo.13871749
10.5281/zenodo.11034770
10.5281/zenodo.4011409
10.5281/zenodo.3856827
10.5281/zenodo.4776015
10.5281/zenodo.1162703
10.15587/2519-4852.2021.230028

Outcomes and Outputs

Outcomes and Outputs

The main outcome of this project will be the development of TBScan, a tuberculosis-specific Large Language Model (TB-LLM), fine-tuned to prioritize genes associated with TB pathogenesis. The key findings and capabilities of TBScan will include:

Antibiotic Resistance: Identification of genes responsible for TB drug resistance, aiding in the understanding of mechanisms underlying treatment failure.
Host Susceptibility: Discovery of genes linked to genotypes that either increase or decrease susceptibility to TB infection.
Transmission Factors: Identification of genetic markers associated with individuals at higher risk of transmission, such as superspreader phenotypes.
Novel Drug Targets: Detection of underexplored genes that could serve as potential therapeutic targets, guiding future drug development.

In addition, TBScan will classify genes based on their biological roles, such as resistance to antibiotics, virulence, and transcriptional regulation, addressing gaps in TB research, including the complexity of non-coding RNAs involved in TB pathogenesis.

Improved Prompt Templates for Researchers and Clinicians

One of the major innovations of TBScan will be its template-based prompt system, designed to help clinicians and researchers interact with the model with ease. These predefined templates will:

Allow users to query specific genes based on their research needs together with specific contexts the model is heavily trained on, minimizing the need for complex input from users.
Include prebuilt query formats for common research and clinical questions, such as identifying genes related to drug resistance, understanding host-pathogen interactions, and prioritizing genes for TB progression studies.

This user-friendly approach will enable clinicians and researchers to utilize TBScan effectively, even with minimal computational knowledge, making it an adaptable tool for a wide range of users in both research and clinical settings.

Data Sharing and Accessibility

To ensure the findings and tools are accessible to the scientific community, the following platforms will be used for data and resource sharing:

GitHub: All analysis scripts, bioinformatics pipelines, and the fine-tuned TBScan LLM will be made publicly available on GitHub.
Zenodo: Zenodo will be used to archive stable versions of the code and models, generating DOIs for each version to facilitate citation and ensure long-term preservation. The fine-tuned models, as well as the training datasets, will be stored here, with comprehensive metadata explaining their contents, conditions, and training parameters.
Figshare: The outcomes of the genomic analyses, including gene classifications, graphs, and reports, will be uploaded to Figshare in reusable formats.

This comprehensive approach ensures that TBScan and the associated findings will be widely accessible, driving advancements in TB research and supporting future efforts in understanding TB pathogenesis and treatment.

Impact/ Scientific Significance

The development of TBScan, a tuberculosis-specific large language model, represents a significant advancement in the application of artificial intelligence to address the complexities of tuberculosis (TB). Trained on comprehensive TB-specific datasets, TBScan will guide actionable insights into critical areas such as TB pathogenesis, drug resistance, and host-genotype interactions.

One of the primary challenges in TB research and clinical practice is the underutilization and complexity of publicly available TB datasets. These datasets—often including transcriptomics, single-cell RNA sequencing, and RNA-seq—require advanced computational expertise to analyze and interpret effectively. This complexity has limited their full potential in both research and clinical settings. TBScan addresses this challenge by providing a user-friendly, TB-specific model that simplifies the analysis and interpretation of large, complex datasets.

By building on diverse data types from underutilized public resources, TBScan enables researchers and clinicians to navigate advanced molecular data without requiring deep computational knowledge. This capacity is critical for gene prioritization across various research contexts, such as TB progression, drug resistance, and host-pathogen interactions. More importantly, TBScan integrates this information to highlight gene-gene interactions and expression patterns across different cellular states and biological conditions within TB, offering a more nuanced understanding of molecular pathways involved in the disease.

In the context of global efforts to eradicate tuberculosis, which remains a significant public health problem—particularly in developing countries—TBScan provides new insights into the mechanisms of TB pathogenesis and identifies priority molecular targets. The model’s ability to interpret the complex gene networks involved in TB pathogenesis and progression positions it as a critical tool for the identification of effective therapeutic targets and the development of vaccines. Given the growing threat of drug-resistant TB strains, TBScan plays a crucial role in identifying the genetic mechanisms underlying resistance by analyzing gene expression profiles in resistant strains. This will help pinpoint molecular pathways that contribute to resistance, ultimately guiding the development of novel therapeutic strategies to combat drug-resistant TB.

By prioritizing context-specific gene interactions and providing knowledge-based insights, TBScan enables researchers to identify key molecular players and biomarkers with greater precision. The model offers clear, data-driven insights that will not only improve research outcomes but also facilitate more sophisticated investigations of TB that were previously limited by single-dataset approaches. This holistic, integrative model allows researchers to make more informed, robust decisions, advancing the broader understanding of TB and its underlying molecular complexities.

Team

Our team brings together a multidisciplinary blend of bioinformatics, machine learning, and clinical expertise, enabling us to tackle the complexities of tuberculosis (TB) research using advanced computational approaches like large language models (LLMs).

Amnah Siddiqa, PhD (https://scholar.google.com/citations?user=fYRqHlcAAAAJ&hl=en&oi=ao), is a bioinformatics expert with extensive experience in multi-omics data analysis, statistical modeling, and the application of machine learning in genomics. She has contributed to numerous projects integrating large-scale omics data, including transcriptomics, RNA-seq, and single-cell data. Her expertise in fine-tuning and benchmarking pre-trained models ensures the robust development of TBScan.
MD Coulibaly is a bioinformatician with strong expertise in computational biology, data integration, and high-throughput omics analyses. With his experience in managing and processing complex datasets, including genome-wide association studies (GWAS) and RNA-seq, MD plays a pivotal role in ensuring the quality control, preprocessing, and integration of the diverse datasets used in this project.
Ilo Dicko, MD, is a medical doctor and epidemiologist with extensive experience in infectious diseases. Dr. Dicko will play an instrumental role in validating the model’s findings and ensuring they are aligned with real-world clinical challenges in TB diagnosis and treatment, particularly in resource-limited settings.

Considerations

- We will use trusted repositories like GREI, PubMed, and GEO to source high-quality omics data, representing diverse Mycobacterium tuberculosis strains and patient populations. Additionally, we will apply our own pipelines or reliable community-driven tools such as nf-core for quality control and preprocessing, ensuring consistency from raw data across all data types.

- Pre-trained LLMs (e.g., BioBERT, GPT-Bio) will be fine-tuned on TB-specific data. Cross-validation and testing on independent datasets will ensure generalizability, and performance will be measured with appropriate metrics.

- Collaboration between bioinformatics, machine learning, medicine, and epidemiology experts will ensure that clinical mechanisms of TB are validated and results are interpreted effectively. We will utilize cloud infrastructures (AWS, Google Cloud) and tools like TensorFlow and PyTorch for model training, supported by bioinformatics libraries.

Supporting Documents

Provide up to 10 resources for the evaluation of your secondary research project including but not limited to: ● The persistent identifier of the dataset(s), other than GREI dataset DOIs already listed above, to be used in the proposed project (where available) ● Tools/workflows or resources to be utilized in the proposed project ● Relevant references or scientific publications that directly relate to the proposed project

Supporting Document (1)

https://drive.google.com/file/d/1nAHpkFUq1zSg3ABU0uHWLLWJ2ynH6Src/view?usp=share_link

Supporting Document (2)

https://drive.google.com/drive/folders/1y4xDT7_wWpyoisD5VGRBAqgX2hRS28Tv?usp=share_link

Supporting Document (3)

https://scholar.google.com/citations?hl=en&user=fYRqHlcAAAAJ&view_op=list_works

Supporting Document (4)

https://github.com/microsoft/LoRA

Supporting Document (5)

https://github.com/dmis-lab/biobert-pytorch

Non Scored Criteria

Please complete this information. It will not be scored by the evaluation panel.

Entity Participation

Participate as an independent Team (i.e., registering as a group of individuals competing together but not on behalf of an established organization, institution, or corporation)

Research Discipline (non-scored criteria)

Bioinformatics, Machine Learning, Infectious Diseases

IDeA State (non-scored criteria)

All Team Member Information - Name, Organization, Job Title, and Email address

1. Amnah Siddiqa, Research Scientist Computational Systems Biology,
Cincinnati Children's Hospital Medical Ctr, amnah.siddiqa@cchmc.org
2. Mamadou D. COULIBALY, MSc Bioinformatics;Bioinformatician/Data Manager,
University Clinical Research Center (UCRC-Mali);Email: mdcoulibaly@icermali.org
3.ILO DICKO, MD, MPH (Epidemiology and disease control)
Clinical Research Coordinator-University Clinical Research Center (UCRC)
Research Fellow- Faculty of Medicine and Odontostomatology (FMOS)
University of Sciences, Techniques and Technologies of Bamako (USTTB), Mali
Email: ilo@icermali.org

MSI (non-scored criteria)

Participation in prior DataWorks! Prizes (non-scored criteria)

Team Point of Contact Eligibility

yes

Eligibility (non-scored criteria)

Yes, I confirm that I have read and meet the terms of eligibility for this challenge

Was this page helpful? yes no