Submission

introduction

title

Fine-tuning GPT model for Alzheimer’s research

short description

Refining AI models for Alzheimer's using single nuclei RNA-seq data to enhance cell-type identification and gene expression analysis.

Phase 1 Submission Form

Overview / Abstract

The foundation models or generative pre-trained transformer models, which are trained on publicly available single cell RNA-sequencing datasets have demonstrated capability to recognize and extract features of gene expression in cells. To fully utilize the benefit of the generative AI to advance biomedical research, there is a need to adapt these models for each specific disease. Our focus is on Alzheimer’s disease (AD) where we focus on fine-tuning a foundation model to the single-nuclei RNA-sequencing datasets to enable important downstream analysis including cell type annotation, network analysis, and deconvolution of spatial transcriptomics data. We will utilize publicly available AD single nuclei RNA-seq datasets for fine-tuning and analysis code available on Zenodo for evaluation and interpretation of the results.

Secondary Analysis: Research Aims

Foundation models, specifically generative pre-trained transformers (GPT), trained on extensive scRNA-seq datasets, have shown promise in biological inference tasks like cell-type identification, multi-omics integration, and gene regulatory network reconstruction. These models outperform traditional methods that rely on marker genes and reference-based mapping when fine-tuned for specific applications (Yang et al., Nature Machine Intelligence, 2022; Cui et al., Nature Methods, 2024).

However, these models are not directly suitable for neurodegenerative diseases like Alzheimer's Disease (AD), which present unique cellular environments and gene expression patterns. Particularly, they have not been effectively adapted for snRNA-seq data, crucial for AD studies as it can be performed on frozen or fixed tissue samples, unlike scRNA-seq that requires fresh samples. Our objective is to fine-tune a foundational model on AD-specific snRNA-seq data to enhance cell-type identification—a critical step for successful downstream analyses including differential gene expression, pathway enrichment, cell-cell communication, gene regulatory network inference, and spatial transcriptomics deconvolution.

Aim 1: Fine-tuning on AD-specific snRNA-seq data for cell-type identification. We will utilize a foundation model trained on over 30 million healthy cells from the CELLxGENE atlas, known as scGPT (Cui et al., Nature Methods, 2024). Our preliminary analysis shows that scGPT only achieves approximately 60% accuracy in cell-type identification when applied to AD patient snRNA-seq data. This discrepancy is likely due to the unique gene expression patterns of AD and the different transcriptomic aspects captured by snRNA-seq compared to scRNA-seq. For fine-tuning, we will employ snRNA-seq data from the ssREAD database (Wang et al., Nature Communications, 2024), which includes a large collection of human samples at varying AD severities, annotated with cell types.

Aim 2: Streamlining downstream analysis and evaluation. For evaluation, we will replicate analyses from recent publications that profiled AD patients using snRNA-seq, utilizing the original code deposited to Zenodo by the authors. This approach involves steps for cell-type identification and downstream analysis. Using the original authors' code will enable us to reproduce the figures in the papers but apply our fine-tuned model for cell-type identification. This will allow us to qualitatively evaluate the changes and interpret the results, assessing the impact of our model enhancements on understanding AD pathology.

This project aims to adapt advanced AI techniques to the specific challenges of Alzheimer’s research, potentially leading to significant breakthroughs in understanding and treating this complex disease.

GREI Repository Data Sets

Zenodo (CERN and Northwestern University)

DOI (Digital Object identifier) of GREI Repository Dataset

Zenodo repository containing the analysis code for secondary analysis
- https://zenodo.org/records/10460116
- https://zenodo.org/records/11051021
- https://zenodo.org/records/4681643
- https://zenodo.org/records/7630313
- https://zenodo.org/records/10729969
- https://zenodo.org/records/13315411
- https://zenodo.org/records/10460196

Outcomes and Outputs

This research will refine a generative pre-trained transformer model tailored for Alzheimer's disease (AD) using snRNA-seq data, enhancing cell-type identification and gene expression analysis. Outcomes include:

Improved cell-type identification accuracy in AD brain tissues.
Detection of AD-specific gene expression changes and their implications for disease pathology.
Reproduction and re-evaluation of previous AD studies, potentially altering conclusions based on our enhanced model.
Insights into spatial gene expression patterns within the AD brain, identifying potential therapeutic targets.

Dissemination of Research Findings

Our findings will be submitted to peer-reviewed journals in computational biology, neuroscience, and bioinformatics, and presented at conferences such as the Alzheimer's Association International Conference and the International Society for Computational Biology. Preprints will be uploaded to bioRxiv or medRxiv. All data, models, and code will be publicly available on Zenodo, enabling other researchers to replicate and extend our work.

Adhering to FAIR Principles

Findable and Accessible: Data and code will be hosted on Zenodo with appropriate tagging and DOIs provided in all related publications. Open software usage licenses will be applied to software releases.
Interoperable and Reusable: We will create a detailed tutorial in a notebook format for both R and Python users, documenting how to utilize our fine-tuned model for downstream analysis. This tutorial will be applicable to any snRNA-seq-based AD research.

Replicability and Reproducibility

Standardized Protocols: All data collection, processing, and analysis will follow standardized protocols, with comprehensive documentation published alongside our research findings.
Version Control: We will use version control systems (Git) to manage all analytical scripts and pipelines, ensuring consistency across the analysis.
Validation and Cross-Validation: We will validate our results using independent datasets and make the code available on Zenodo for transparency. This approach allows us to assess and confirm the robustness of our findings compared to original analyses.
Secondary Analysis: By sharing our methods and models extensively, we enable other researchers to replicate our analyses on different datasets, supporting the reproducibility of secondary analysis results.

Impact/ Scientific Significance

Alzheimer’s Disease (AD) is the most common neurodegenerative disorder and a leading cause of dementia worldwide. Despite extensive research, the underlying mechanisms of AD progression remain poorly understood, with significant gaps in our knowledge about how gene expression changes across different cell types and brain regions contribute to disease pathology. Recent advances in single-cell RNA sequencing (scRNA-seq) and single-nuclei RNA sequencing (snRNA-seq) have enabled unprecedented resolution of cellular heterogeneity in the human brain, providing crucial insights into disease-relevant cell populations and gene expression programs. However, significant challenges remain in integrating this high-dimensional data into comprehensive models that can inform our understanding of AD and reveal actionable therapeutic targets.

While scRNA-seq has been instrumental in characterizing cellular states in various tissues, its application to AD research has been limited by technical constraints. scRNA-seq requires fresh, intact cells, which are difficult to obtain from postmortem brain tissue—the primary source of samples for AD studies. As a result, the use of scRNA-seq in AD is restricted, leaving major gaps in our ability to study cellular changes in advanced stages of the disease. In contrast, snRNA-seq, which isolates nuclei from frozen or fixed tissues, provides a more practical alternative for AD research, enabling access to archived brain samples from biobanks. However, snRNA-seq captures distinct transcriptomic features, primarily detecting nuclear RNA, including pre-mRNA (unspliced transcripts), which poses a challenge for models trained on scRNA-seq data that focus on mature, cytoplasmic RNA. Reconciling these differences is crucial for developing robust models capable of leveraging both scRNA-seq and snRNA-seq data in AD.

By improving the accuracy of foundational models in detecting AD-specific gene expression patterns and integrating them with spatial transcriptomics, we aim to uncover previously unknown cellular interactions and regulatory pathways involved in AD. These insights have the potential to inform the development of new therapeutic strategies, particularly in targeting specific cell populations or spatial regions within the brain that contribute to disease progression. The proposed research could ultimately pave the way for precision medicine approaches tailored to the unique cellular and spatial landscape of AD.

Team

Dr. Seong-Hwan Jun is an Assistant Professor in the Department of Biostatistics and Computational Biology at the University of Rochester. He holds a Ph.D. in Statistics with a focus on machine learning and has extensive experience in developing statistical methods to analyze single cell data. His research works are featured in top-tier machine learning conferences such as ICML, NeurIPS, and AISTATS and Nature Communications.

Dr. Hyung Jin Ahn is an Assistant Professor in the Department of Pharmacology, Physiology, and Neuroscience at New Jersey Medical School, Rutgers University. He specializes in Alzheimer's disease research, particularly focusing on cerebrovascular dysfunction and its role in the pathogenesis of the disease. He has made numerous discoveries underlying AD pathogenesis and his research works are featured in high impact journals such as Neuron, Blood, and Proceedings of National Academy of Science.

Drs. Jun and Ahn first collaborated during a workshop on the use of Artificial Intelligence in biomedical research. This meeting sparked a shared interest in leveraging AI models to improve research into Alzheimer’s disease. Dr. Jun will concentrate on the fine-tuning of AI models and conducting downstream analyses with newly developed methods. Meanwhile, Dr. Ahn will provide critical insights into Alzheimer’s disease that are essential for interpreting and solidifying the research findings.

Considerations

Computational Resources: access to the University of Rochester’s high-performance computing center with NVIDIA A100 and H100 GPUs, as well as older models like K80 and V100 for testing, development, and deployment of fine-tuned GPT model.

High-Quality Data: the ssREAD dataset, containing hundreds of curated samples at various AD stages, enables the model to learn diverse gene expression patterns, improving understanding of AD progression.

Secondary analysis: analysis code on Zenodo provides a means to reproduce and compare the AI-based results to existing approaches.

Expertise: Dr. Ahn's AD pathology knowledge ensures accurate contextualization of AI-enhanced analyses, while Dr. Jun’s machine learning expertise and Python/PyTorch skills optimize neural network models. This long-term collaboration aims to leverage generative AI to uncover new AD pathology insights, integrating single-cell and spatial transcriptomics data for significant discoveries in neurodegenerative diseases

Supporting Documents

Provide up to 10 resources for the evaluation of your secondary research project including but not limited to: ● The persistent identifier of the dataset(s), other than GREI dataset DOIs already listed above, to be used in the proposed project (where available) ● Tools/workflows or resources to be utilized in the proposed project ● Relevant references or scientific publications that directly relate to the proposed project

Supporting Document (1)

https://www.nature.com/articles/s41467-024-49133-z

Supporting Document (2)

https://www.nature.com/articles/s41592-024-02201-0

Supporting Document (3)

https://www.nature.com/articles/s42256-022-00534-z

Supporting Document (4)

https://bmblx.bmi.osumc.edu/ssread/

Supporting Document (5)

https://zenodo.org/records/10466117

Supporting Document (6)

https://zenodo.org/records/6572672

Non Scored Criteria

Please complete this information. It will not be scored by the evaluation panel.

Entity Participation

Participate as an independent Team (i.e., registering as a group of individuals competing together but not on behalf of an established organization, institution, or corporation)

Research Discipline (non-scored criteria)

Alzheimer's disease.
Single cell transcriptomics.
Artificial intelligence.
Machine learning.
Biostatistics.

IDeA State (non-scored criteria)

All Team Member Information - Name, Organization, Job Title, and Email address

Captain: Dr. Hyung Jin Ahn, New Jersey Medical School, Rutgers University, Assistant Professor, hyungjin.ahn@rutgers.edu.
Member: Dr. Seong-Hwan Jun, University of Rochester Medical Center, Assistant Professor, seonghwan_jun@urmc.rochester.edu.

MSI (non-scored criteria)

Participation in prior DataWorks! Prizes (non-scored criteria)

Team Point of Contact Eligibility

yes

Eligibility (non-scored criteria)

Yes, I confirm that I have read and meet the terms of eligibility for this challenge

Was this page helpful? yes no