We wish to conduct a heterogeneity analysis and search for cell-subtypes within the cerebral organoid dataset collected by [Glasauer et al. 2022], as described below.
In our recent work, we found that Alzheimer's Disease (AD) exhibits heterogeneity at a genetic level.
This heterogeneity has a hierarchical structure involving (i) three different genetic correlation patterns surrounding the MAPT-gene at the base level, which then (ii) further subdivide into disease-specific clusters [Elman et al. 2024].
The MAPT-gene implicated in this heterogeneity encodes for tau, which in turn impacts neuronal development and is a major risk factor for AD.
The organoid dataset from Glasauer et al. provides us with an excellent opportunity to study the heterogeneity linked to two such MAPT-mutations (i.e., tau-mutations).
By identifying the cell subtypes that emerge as a result of tau-mutation, we'll take an important step towards understanding the mechanisms underlying tau pathology, as well as AD.
Ultimately, we aim to identify genetic heterogeneity within the cerebral organoid dataset collected by Glasauer et al.
This dataset is stored on Dryad (doi.org/10.25349/D95898) and contains sn-RNA seq data from 62 samples, comprising ~80K cells measured across ~30K genes.
Importantly, the samples in this dataset involve organoids with tau-mutations (337VM, 406RW and 406WW) as well as isogenic controls.
Moreover, the various cells across the samples have already been characterized by putative type (e.g., excitatory pyramidal neurons, astrocytes, etc.).
The original analysis conducted in Glasauer et al. assesses (for each cell-type) the gene-expression differences between the tau-mutants and the controls, generally assuming that each cell-type is a homogeneous group.
We will conduct a follow-up analysis to identify tau-induced heterogeneity: namely, to identify tau-specific 'biclusters' within each cell-type.
Each bicluster will involve a subset of mutant cells that exhibit genetic correlations (across a subset of genes) that are not shared by the corresponding isogenic controls.
Each bicluster of this kind can be thought of as a genetically characterized tau-specific cell-subtype.
Once identified, we will perform a differential gene- and pathway-analysis on each identified cell-subtype, contrasting the likely gene-interactions within each tau-specific cell-subtype against the other subtypes, as well as against the controls.
To actually identify the cell-subtypes (i.e., biclusters) described above, we will use our recently developed biclustering algorithm, referred to as `loop-counting' [Zhou et al. 2024].
This loop-counting strategy has several advantages:
To start, our strategy is the first we are aware of that can detect biclusters enriched for asymmetric gene-gene-interactions such as gating and dependency, in addition to more traditional symmetric interactions like gene-gene-correlations.
Crucially, our loop-counting framework can correct for controls, identifying only those biclusters that are disease-specific, while also correcting for covariates, such as batch number; our workflow can also naturally accommodate for any missing data.
Finally, our algorithm uses a permutation test (against a label-shuffled null-hypothesis) to give each bicluster found a p-value, ensuring a fixed false-discovery rate.
To the best of our knowlege our algorithm outperforms other commonly used algorithms in the literature, including Louvain-clustering and the UMAP-clustering used in Glasauer's original analysis (see appendix of Zhou et al. 2024).
In terms of a timeline, we'll start by unpacking, cleaning and running basic diagnostics on the data.
Within two months we expect to have finished the primary biclustering analysis, and after another two months we will have finished the gene- and pathway-analyses.
Along the way, we expect to spend 2-4 months preparing our results for presentation and publication.
We expect to identify several statistically significant biclusters within the organoid dataset from Glasauer et al.
Notably, the original analyses did not look for the same kinds of biclusters that we described above.
For example, Glasauer et al. reported no statistically significant enrichment of mutant cells within the broader category of excitatory pyramidal neurons.
By contrast, our methods have been specifically designed to accurately assess heterogeneity within sn-RNA-seq data, and have already been shown to work on organoid-data.
Indeed, a preliminary analysis of this data-set suggests the existence of multiple distinguishable cell-subtypes within many of the original cell-types, including the excitatory pyramidal neurons.
We believe that the additional sensitivity afforded by our methodology will allow us to characterize the tau-driven subtypes within this heterogeneous landscape, paving the way for a more detailed understanding of tau pathology.
We will delineate these cell-subtypes in terms of both (i) their p-value, (ii) the symmetric- and asymmetric-gene-gene-interactions that can be used to characterize that subtype, as well as (iii) the most prominent gene pathways and interactions which distinguish that tau-specific subtype from the corresponding control cell type.
We will also check to see if any of the subytpes are enriched for the AD-specific genes and pathways we identified in our earlier heterogeneity analysis [Elman et al. 2024].
We will write up a manuscript for publication in an AD-specific journal (e.g., the Journal of Alzheimer's Disease).
We will also prepare a poster presentation for the AAIC (Alzheimer's Association International Conference) in 2025.
One of the advantages of our biclustering methods is that they are automatic and deterministic, similar in some ways to principal-component-analysis (PCA).
These features mean that our results can be easily replicated and reproduced by any other researchers.
In accordance with FAIR principles, we will share the software and scripts, as well as all the results used to perform this analysis at github.com/adirangan.
The software and results will be packaged with a short vignette (i.e., tutorial) allowing others to run the same analysis and reproduce the results themselves.
The format for the output data files, including summary statistics, will be ascii-readable tables (e.g., csv-arrays with headers).
This particular project does not use data directly associated with individuals, and so the CARE principles do not directly apply.
Alzheimer's Disease (AD) accounts for most dementia cases in the United States, with an estimated 6.5 million individuals over the age of 65 currently suffering from the disease, a number that is expected to increase drastically in the coming decades.
The disease comes with enormous economic costs for the country, as well as a devastating personal cost for patients and their loved ones.
While there has been a concerted effort towards treatment and prevention of AD, clinical trials have had limited success in preventing AD-related cognitive decline.
Indeed, while the FDA has approved two anti-amyloid drugs, there has been controversy over their impact on clinical worsening.
Thus, treating AD remains an ongoing public health priority, and better explaining its disease etiology will facilitate these efforts.
AD is a complex disease involving a collection of symptoms including amnestic impairment, neurodegeneration and cognitive decline that eventually leads to loss of everyday functioning.
While the etiology of AD remains unclear, it is likely due to a combination of genetic and environmental factors.
Moreover, while there are many prototypical features of AD which hold in aggregate, it is widely acknowledged that there is significant variability in how AD presents across individuals.
Several studies have attempted to characterize this phenotypical variability, however there is still a great deal of unexplained heterogeneity in AD presentation, with the potential for distinct disease subtypes.
To make matters more complicated, there is no guarantee that a characterization of this phenotypic heterogeneity (e.g., cognitive, pathological and atrophy subtypes of AD), will help cleanly delineate homogeneous subgroups of genetic risk, or allow for better risk assessment.
With this in mind, our current project focuses on directly identifying potential sources of genetically-driven heterogeneity in AD, without relying on prior phenotypic classification.
Once identified, the structure underlying this genetic heterogeneity (in the form of tau-specific cell-subtypes) can help trace a route from genetic risk through to disease phenotype, and can help clarify which pathways are impacted by different forms of the disease etiology (e.g., certain tau-mutations).
In this sense, we hope that this project will provide an 'anchor point' we can build on to more fully characterize the downstream impacts of genetic heterogeneity in AD.
In summary, we hope that by identifying tau-specific genetic cell-subtypes we can help better reclassify and/or predict tau-pathology and AD-prognosis, including disease trajectory and disease response at the individual level.
Aaditya Rangan is an associate-professor in the applied mathematics department at New York University who has worked in computational biology and bioinformatics for over two decades.
Jeremy Elman in an assistant-professor in the department of psychiatry at the University of California at San Diego, and has studied Alzheimer’s disease for over a decade.
Caroline McGrouther is an MD PhD from the University of California at San Diego, and has been working as a researcher in bioinformatics for several years.
Haosheng Zhou was the master's student of A. Rangan, and is currently working towards a PhD in statistics and applied probability at the University of California at Santa Barbara.
We initially began collaborating as part of an NIH grant (U19AG023122) and have published several papers together [e.g., Schork and Elman 2023, Zhou et al 2024, Elman et al. 2024 and McGrouther et al. 2024].
Zhou is well on his way towards his PhD, and together with Rangan, McGrouther and Elman our team has a strong record of research involving statistical analysis in one form or another, with a strong focus on bioinformatics.
Perhaps most importantly, through the course of our collaborations we have developed, implemented and tested the biclustering method described above.
This includes the original loop-counting method [Rangan 2012, Rangan et al. 2018], as well as the more recent application to sn-RNA-seq data and the extension to incorporate asymmetric gene-gene-interactions [Zhou et al. 2024].
As mentioned above, we recently published two papers exploring the genetic heterogeneity of AD.
These results illustrate the genetic heterogeneity of AD in terms of both the underlying genotype [Elman et al. 2024] and at the pathway-level [Schork and Elman 2023].
Furthermore, the hierarchical nature of AD genetic heterogeneity is strongly driven by stratification in the correlation-structure across SNPs surrounding the MAPT-gene.
This evidence strongly motivates our proposed study of the MAPT-mutations in the organoid-dataset of Glasauer et al.
In addition, we believe that we are uniquely poised to search for heterogeneity within this dataset, as our recently developed biclustering algorithms are specifically designed for just such a problem.
In summary, we are certain that we have the expertise, the computation tools and the manpower to carry out this project.