Tuberculosis (TB) remains a major public health concern with 10.6 million cases and 1.3 million deaths globally in 2022. TB is a complex disease influenced by immune status, living conditions, and genetic predispositions, making control efforts challenging.This project proposes to develop TBScan, a tuberculosis-specific Large Language Model (LLM), trained on transcriptomics, whole genome sequencing, and single-cell datasets. We aim to fine-tune pre-trained LLMs such as BioBERT and BioGPT using TB-specific data to prioritize genes involved in TB pathogenesis, integrating genomic insights like SNPs and host-pathogen interactions. We will benchmark these models, fine-tune them using LORA, and assess their ability to classify genes relevant to TB, such as drug resistance and virulence factors. This approach will advance TB research by improving diagnosis, treatment, and prevention through a robust gene prioritization tool by delivering a model that leverages prior knowledge and data.
Data Collection:
We will scan repositories such as Mendley, Zenodo, PubMed, and GEO to identify publicly available TB datasets, including transcriptomics, WGS, and single-cell data. We aim to collect data on TB pathogenesis, host-genotype interactions, and drug resistance.
Inclusion Criteria:
We anticipate working with 10-20 high-quality datasets per data type, totaling over 10,000 samples. We will filter datasets based on quality control metrics, allowing up to 10% missingness. Our team has extensive experience managing heterogeneous datasets and addressing missing data, batch effects, and biological variability.
Ethical Considerations:
All data will adhere to ethical guidelines, ensuring anonymity and compliance with FAIR principles. We will avoid bias in model outputs by ensuring non-discriminatory use of demographic data and follow informed consent guidelines for publicly available datasets.
Methods:
Timeline:
Outcomes and Outputs
The main outcome of this project will be the development of TBScan, a tuberculosis-specific Large Language Model (TB-LLM), fine-tuned to prioritize genes associated with TB pathogenesis. The key findings and capabilities of TBScan will include:
In addition, TBScan will classify genes based on their biological roles, such as resistance to antibiotics, virulence, and transcriptional regulation, addressing gaps in TB research, including the complexity of non-coding RNAs involved in TB pathogenesis.
Improved Prompt Templates for Researchers and Clinicians
One of the major innovations of TBScan will be its template-based prompt system, designed to help clinicians and researchers interact with the model with ease. These predefined templates will:
This user-friendly approach will enable clinicians and researchers to utilize TBScan effectively, even with minimal computational knowledge, making it an adaptable tool for a wide range of users in both research and clinical settings.
Data Sharing and Accessibility
To ensure the findings and tools are accessible to the scientific community, the following platforms will be used for data and resource sharing:
This comprehensive approach ensures that TBScan and the associated findings will be widely accessible, driving advancements in TB research and supporting future efforts in understanding TB pathogenesis and treatment.
The development of TBScan, a tuberculosis-specific large language model, represents a significant advancement in the application of artificial intelligence to address the complexities of tuberculosis (TB). Trained on comprehensive TB-specific datasets, TBScan will guide actionable insights into critical areas such as TB pathogenesis, drug resistance, and host-genotype interactions.
One of the primary challenges in TB research and clinical practice is the underutilization and complexity of publicly available TB datasets. These datasets—often including transcriptomics, single-cell RNA sequencing, and RNA-seq—require advanced computational expertise to analyze and interpret effectively. This complexity has limited their full potential in both research and clinical settings. TBScan addresses this challenge by providing a user-friendly, TB-specific model that simplifies the analysis and interpretation of large, complex datasets.
By building on diverse data types from underutilized public resources, TBScan enables researchers and clinicians to navigate advanced molecular data without requiring deep computational knowledge. This capacity is critical for gene prioritization across various research contexts, such as TB progression, drug resistance, and host-pathogen interactions. More importantly, TBScan integrates this information to highlight gene-gene interactions and expression patterns across different cellular states and biological conditions within TB, offering a more nuanced understanding of molecular pathways involved in the disease.
In the context of global efforts to eradicate tuberculosis, which remains a significant public health problem—particularly in developing countries—TBScan provides new insights into the mechanisms of TB pathogenesis and identifies priority molecular targets. The model’s ability to interpret the complex gene networks involved in TB pathogenesis and progression positions it as a critical tool for the identification of effective therapeutic targets and the development of vaccines. Given the growing threat of drug-resistant TB strains, TBScan plays a crucial role in identifying the genetic mechanisms underlying resistance by analyzing gene expression profiles in resistant strains. This will help pinpoint molecular pathways that contribute to resistance, ultimately guiding the development of novel therapeutic strategies to combat drug-resistant TB.
By prioritizing context-specific gene interactions and providing knowledge-based insights, TBScan enables researchers to identify key molecular players and biomarkers with greater precision. The model offers clear, data-driven insights that will not only improve research outcomes but also facilitate more sophisticated investigations of TB that were previously limited by single-dataset approaches. This holistic, integrative model allows researchers to make more informed, robust decisions, advancing the broader understanding of TB and its underlying molecular complexities.
Our team brings together a multidisciplinary blend of bioinformatics, machine learning, and clinical expertise, enabling us to tackle the complexities of tuberculosis (TB) research using advanced computational approaches like large language models (LLMs).
- We will use trusted repositories like GREI, PubMed, and GEO to source high-quality omics data, representing diverse Mycobacterium tuberculosis strains and patient populations. Additionally, we will apply our own pipelines or reliable community-driven tools such as nf-core for quality control and preprocessing, ensuring consistency from raw data across all data types.
- Pre-trained LLMs (e.g., BioBERT, GPT-Bio) will be fine-tuned on TB-specific data. Cross-validation and testing on independent datasets will ensure generalizability, and performance will be measured with appropriate metrics.
- Collaboration between bioinformatics, machine learning, medicine, and epidemiology experts will ensure that clinical mechanisms of TB are validated and results are interpreted effectively. We will utilize cloud infrastructures (AWS, Google Cloud) and tools like TensorFlow and PyTorch for model training, supported by bioinformatics libraries.