The foundation models or generative pre-trained transformer models, which are trained on publicly available single cell RNA-sequencing datasets have demonstrated capability to recognize and extract features of gene expression in cells. To fully utilize the benefit of the generative AI to advance biomedical research, there is a need to adapt these models for each specific disease. Our focus is on Alzheimer’s disease (AD) where we focus on fine-tuning a foundation model to the single-nuclei RNA-sequencing datasets to enable important downstream analysis including cell type annotation, network analysis, and deconvolution of spatial transcriptomics data. We will utilize publicly available AD single nuclei RNA-seq datasets for fine-tuning and analysis code available on Zenodo for evaluation and interpretation of the results.
Foundation models, specifically generative pre-trained transformers (GPT), trained on extensive scRNA-seq datasets, have shown promise in biological inference tasks like cell-type identification, multi-omics integration, and gene regulatory network reconstruction. These models outperform traditional methods that rely on marker genes and reference-based mapping when fine-tuned for specific applications (Yang et al., Nature Machine Intelligence, 2022; Cui et al., Nature Methods, 2024).
However, these models are not directly suitable for neurodegenerative diseases like Alzheimer's Disease (AD), which present unique cellular environments and gene expression patterns. Particularly, they have not been effectively adapted for snRNA-seq data, crucial for AD studies as it can be performed on frozen or fixed tissue samples, unlike scRNA-seq that requires fresh samples. Our objective is to fine-tune a foundational model on AD-specific snRNA-seq data to enhance cell-type identification—a critical step for successful downstream analyses including differential gene expression, pathway enrichment, cell-cell communication, gene regulatory network inference, and spatial transcriptomics deconvolution.
Aim 1: Fine-tuning on AD-specific snRNA-seq data for cell-type identification. We will utilize a foundation model trained on over 30 million healthy cells from the CELLxGENE atlas, known as scGPT (Cui et al., Nature Methods, 2024). Our preliminary analysis shows that scGPT only achieves approximately 60% accuracy in cell-type identification when applied to AD patient snRNA-seq data. This discrepancy is likely due to the unique gene expression patterns of AD and the different transcriptomic aspects captured by snRNA-seq compared to scRNA-seq. For fine-tuning, we will employ snRNA-seq data from the ssREAD database (Wang et al., Nature Communications, 2024), which includes a large collection of human samples at varying AD severities, annotated with cell types.
Aim 2: Streamlining downstream analysis and evaluation. For evaluation, we will replicate analyses from recent publications that profiled AD patients using snRNA-seq, utilizing the original code deposited to Zenodo by the authors. This approach involves steps for cell-type identification and downstream analysis. Using the original authors' code will enable us to reproduce the figures in the papers but apply our fine-tuned model for cell-type identification. This will allow us to qualitatively evaluate the changes and interpret the results, assessing the impact of our model enhancements on understanding AD pathology.
This project aims to adapt advanced AI techniques to the specific challenges of Alzheimer’s research, potentially leading to significant breakthroughs in understanding and treating this complex disease.
This research will refine a generative pre-trained transformer model tailored for Alzheimer's disease (AD) using snRNA-seq data, enhancing cell-type identification and gene expression analysis. Outcomes include:
Dissemination of Research Findings
Our findings will be submitted to peer-reviewed journals in computational biology, neuroscience, and bioinformatics, and presented at conferences such as the Alzheimer's Association International Conference and the International Society for Computational Biology. Preprints will be uploaded to bioRxiv or medRxiv. All data, models, and code will be publicly available on Zenodo, enabling other researchers to replicate and extend our work.
Adhering to FAIR Principles
Replicability and Reproducibility
Alzheimer’s Disease (AD) is the most common neurodegenerative disorder and a leading cause of dementia worldwide. Despite extensive research, the underlying mechanisms of AD progression remain poorly understood, with significant gaps in our knowledge about how gene expression changes across different cell types and brain regions contribute to disease pathology. Recent advances in single-cell RNA sequencing (scRNA-seq) and single-nuclei RNA sequencing (snRNA-seq) have enabled unprecedented resolution of cellular heterogeneity in the human brain, providing crucial insights into disease-relevant cell populations and gene expression programs. However, significant challenges remain in integrating this high-dimensional data into comprehensive models that can inform our understanding of AD and reveal actionable therapeutic targets.
While scRNA-seq has been instrumental in characterizing cellular states in various tissues, its application to AD research has been limited by technical constraints. scRNA-seq requires fresh, intact cells, which are difficult to obtain from postmortem brain tissue—the primary source of samples for AD studies. As a result, the use of scRNA-seq in AD is restricted, leaving major gaps in our ability to study cellular changes in advanced stages of the disease. In contrast, snRNA-seq, which isolates nuclei from frozen or fixed tissues, provides a more practical alternative for AD research, enabling access to archived brain samples from biobanks. However, snRNA-seq captures distinct transcriptomic features, primarily detecting nuclear RNA, including pre-mRNA (unspliced transcripts), which poses a challenge for models trained on scRNA-seq data that focus on mature, cytoplasmic RNA. Reconciling these differences is crucial for developing robust models capable of leveraging both scRNA-seq and snRNA-seq data in AD.
By improving the accuracy of foundational models in detecting AD-specific gene expression patterns and integrating them with spatial transcriptomics, we aim to uncover previously unknown cellular interactions and regulatory pathways involved in AD. These insights have the potential to inform the development of new therapeutic strategies, particularly in targeting specific cell populations or spatial regions within the brain that contribute to disease progression. The proposed research could ultimately pave the way for precision medicine approaches tailored to the unique cellular and spatial landscape of AD.
Dr. Seong-Hwan Jun is an Assistant Professor in the Department of Biostatistics and Computational Biology at the University of Rochester. He holds a Ph.D. in Statistics with a focus on machine learning and has extensive experience in developing statistical methods to analyze single cell data. His research works are featured in top-tier machine learning conferences such as ICML, NeurIPS, and AISTATS and Nature Communications.
Dr. Hyung Jin Ahn is an Assistant Professor in the Department of Pharmacology, Physiology, and Neuroscience at New Jersey Medical School, Rutgers University. He specializes in Alzheimer's disease research, particularly focusing on cerebrovascular dysfunction and its role in the pathogenesis of the disease. He has made numerous discoveries underlying AD pathogenesis and his research works are featured in high impact journals such as Neuron, Blood, and Proceedings of National Academy of Science.
Drs. Jun and Ahn first collaborated during a workshop on the use of Artificial Intelligence in biomedical research. This meeting sparked a shared interest in leveraging AI models to improve research into Alzheimer’s disease. Dr. Jun will concentrate on the fine-tuning of AI models and conducting downstream analyses with newly developed methods. Meanwhile, Dr. Ahn will provide critical insights into Alzheimer’s disease that are essential for interpreting and solidifying the research findings.
Computational Resources: access to the University of Rochester’s high-performance computing center with NVIDIA A100 and H100 GPUs, as well as older models like K80 and V100 for testing, development, and deployment of fine-tuned GPT model.
High-Quality Data: the ssREAD dataset, containing hundreds of curated samples at various AD stages, enables the model to learn diverse gene expression patterns, improving understanding of AD progression.
Secondary analysis: analysis code on Zenodo provides a means to reproduce and compare the AI-based results to existing approaches.
Expertise: Dr. Ahn's AD pathology knowledge ensures accurate contextualization of AI-enhanced analyses, while Dr. Jun’s machine learning expertise and Python/PyTorch skills optimize neural network models. This long-term collaboration aims to leverage generative AI to uncover new AD pathology insights, integrating single-cell and spatial transcriptomics data for significant discoveries in neurodegenerative diseases