DiSCO
Team member remains the same as the first submission, with the exception that Kavi has obtained her MS degree.
Kuan-lin Huang leads the team as an Associate Professor at Icahn School of Medicine, bringing extensive expertise in genomics, AI, and statistical methods for biomedical applications. His experience spans leading large-scale genomic studies and developing AI-driven approaches to improve human health.
Kavitharini Saravanan is a Master in bioinformatics at the University of Charlotte. She brings skills in data curation and computational biology, focusing on integrating large datasets.
Aryan Saharan, an undergraduate student at St. Louis University, contributes with experience in biomedical literature search and statistical analysis, supporting data preprocessing.
Catrina Yang, a medical student at the University of Oxford, brings a clinical perspective, ensuring the project aligns with medical relevance and facilitates the translation of research findings into health applications.
doi: 10.5281/zenodo.14261274
doi: 10.5281/zenodo.4739739
doi: 10.5281/zenodo.13241021
doi: 10.5281/zenodo.10515792
doi: 10.5281/zenodo.13635676
doi: 10.5281/zenodo.12752107
doi: 10.17632/pcftzv8w63.1
doi: 10.17632/vs8m5gkyfn.1
doi: 10.17632/f2v94hj7jm.1
doi: 10.5281/zenodo.7113422
doi: 10.5281/zenodo.1476122
doi: 10.17632/gncg57p5x9.2
doi: 10.5281/zenodo.4013713
doi: 10.5281/zenodo.3572422
doi: 10.5281/zenodo.8076402
doi: 10.5281/zenodo.6282659
doi: 10.5281/zenodo.7469683
doi: 10.17632/dvp4y5ttd5.1
doi: 10.17632/m4rfg9wb29.1
doi: 10.5281/zenodo.7561827
doi: 10.5281/zenodo.13863809
doi: 10.17632/7g5cftcbpv.1
doi: 10.5281/zenodo.3483177
doi: 10.5281/zenodo.4271105
(some not listed due to space)
Single-cell Omic atlases are transforming biology and medicine, yet their demographic representativeness has not been systematically evaluated. We analyzed >13,500 samples from the Human Cell Atlas (HCA), Human Tumor Atlas Network (HTAN), and PsychAD Consortium. We found a striking, pervasive European over-representation and under-representation of Asian and Latino individuals. Nearly 70% of HCA samples lacked ancestry annotation, PsychAD was almost two-thirds European with Asians under-represented, HTAN tumors were 69% European where several cancer types also showed higher female fractions than sex-specific US incidence. These disparities highlight that current single-cell resources risk embedding inequities into AI foundational models, biomarker discovery, and therapeutic development. We provide a practical checklist for research teams to produce single-cell datasets with fair representations and complete metadata to maximize translational values.
(Preprinted/in Review in Nat. Gen)
We performed a secondary analysis to systematically evaluate demographic representation across large-scale single-cell resources, with the overarching goal of assessing whether these flagship datasets reflect the diversity of the populations they are intended to represent.
Inclusion of GREI Repository Data
Data harmonized and curated within the GREI repository served as an enabling infrastructure for this work. Particularly, most of the Human Cell Atlas (HCA), Human Tumor Atlas Network (HTAN), and PsychAD studies had previously deposited data and meta-data through these platforms that enabled our analyses:
Addressing the Scientific Question
Our scientific objective was to determine whether large-scale sc-Omics resources are demographically representative and thus appropriate as universal reference frameworks. By benchmarking observed ancestry and sex distributions against global population statistics (for HCA), U.S. cancer incidence from SEER (for HTAN), and ADRD prevalence estimates from various publications (for PsychAD), we identified pervasive imbalances. Across datasets, Europeans were consistently overrepresented, while Asians and Latinos were systematically underrepresented. Sex distributions varied by disease and tissue context. These findings extend prior observations in GWAS and sequencing studies, demonstrating that inequities in representation are embedded within single-cell resources that underpin current biomedical discovery.
Models, Agents, and Technology
All analyses were conducted using reproducible computational workflows in Visual Studio Code with the R-programming language, where the codebase is publicly available via GitHub. R packages used included dplyr, tidyr, ggplot2, car, and scipy for data processing, statistical testing, and visualization. Representation ratios, chi-square tests, and diversity indices (Shannon, Simpson) were employed to quantify disparities. Equity indices were calculated to capture the magnitude of over- vs. underrepresentation. Data curation pipelines were version-controlled, and missingness was explicitly quantified across demographic fields to assess whether data gaps disproportionately affected underrepresented groups. AI-agent-assisted coding, primarily Co-Pilot that calls OpenAI and Anthropic APIs, was used in initial drafts and edits of codes, but all outputs were reviewed and validated by the investigators.
Conclusions
This secondary analysis shows that structural imbalances in ancestry and sex persist across large-scale single-cell omic datasets. European ancestry is disproportionately represented in all three resources, with Asian and Latino ancestry markedly underrepresented. While sex distribution varied, significant skews were observed in specific tissues and diseases. These biases raise concerns about the generalizability of single-cell reference maps and their downstream use in biomarker discovery and therapeutic development. Our findings highlight the need for intentional recruitment, standardized metadata reporting, and fairness benchmarks to ensure that sc-omics resources evolve into fair and representative references for human biology and precision medicine.
Outcomes and Research Findings
Our project provides the first systematic diversity audit of large-scale single-cell -omic resources, revealing pervasive ancestry and sex imbalances. By analyzing over 13,500 samples from the Human Cell Atlas (HCA), Human Tumor Atlas Network (HTAN), and PsychAD consortium, we demonstrated that European ancestry is consistently overrepresented, while Asian and Latino ancestry are markedly underrepresented across tissues, diseases, and cancer types. Sex distributions varied by context: near parity in PsychAD, strong female skew in HTAN compared to US SEER estimates, and tissue-specific imbalances in HCA. These findings underscore that single-cell omic atlas datasets—often positioned as universal reference frameworks—may not generalize globally and risk embedding inequities into downstream discovery, biomarker validation, and therapeutic development.
Methods and Metadata Considerations
Our secondary analysis relied on harmonizing metadata across heterogeneous sources. For each dataset, we extracted available demographic annotations (ancestry, sex, age, disease status) directly from metadata files from linked publications, primarily from GREI repository and/or supplementary information. Missingness was systematically quantified. Disease- and tissue-level stratification ensured biologically meaningful benchmarking against external reference populations (global population shares for HCA, SEER cancer incidence for HTAN, and ADRD prevalence estimates for PsychAD).
Standards, Resources, and Tools
To ensure consistency and transparency, we employed open standards and widely adopted resources:
Replicability and Reproducibility
Replicability was addressed by exclusively using open-source data and tools, ensuring that all datasets and code are accessible for re-analysis. Metadata curation steps—including manual review of supplementary materials and reclassification of ancestry categories—were logged in detail. Standardization pipelines were applied consistently across all datasets. Reproducibility was reinforced through:
Conclusions
The outputs of this research project are both scientific and infrastructural. Scientifically, our findings highlight systemic biases in current single-cell omic resources, demonstrating the urgent need for structural reforms in recruitment, metadata completeness, and equity benchmarking. Infrastructurally, the project delivers harmonized metadata tables, standardized analytic pipelines, and reproducible workflows that can serve as templates for future diversity audits. Collectively, these outcomes advance the field by embedding considerations of population representation into single-cell biology and laying the groundwork for datasets that more accurately reflect global and U.S. populations.
Contributions to Scientific Disciplines
Our research advances the fields of genomics, single-cell biology, and precision medicine by providing a demographic audit of single-cell -omic atlases. While prior critiques have highlighted inequities in genome-wide association studies (GWAS), our work extends to cellular-resolution datasets that are rapidly becoming the foundation for biomarker discovery, drug development, and training of biomedical foundation AI models. By curating and benchmarking data across >13,500 samples HCA, HTAN, and PsychAD, we establish a fairness framework that can be applied to any sc-Omics resource, contributing to scientific disciplines in three major ways:
Impact on Disease Diagnosis and Treatment
Single-cell atlases are increasingly used as reference frameworks for interpreting patient-derived samples, prioritizing disease biomarkers, and identifying therapeutic targets. However, our analysis demonstrates that these resources disproportionately reflect European ancestry, while Asian and Latino ancestry remain systematically underrepresented. This imbalance poses several risks for human health:
Broader Implications
Our work highlights that structural inequities in representation are embedded in single-cell datasets that underpin next-generation precision medicine. Unless corrected, these biases could perpetuate health disparities in diagnostics and therapeutics for decades. By documenting these imbalances and providing open, reproducible methods, we lay the groundwork for structural reforms in recruitment, metadata standards, and fairness benchmarking. Our findings emphasize that realizing the promise of single-cell biology for human health will require more intentional efforts to ensure that all populations are represented in the foundational resources.
Project Completion and Revisions
The project was completed within the award period, with all planned analyses finalized and integrated into the manuscript, now preprinted and under peer review. During implementation, the scope was refined to focus on three mature resources—HCA, HTAN, and PsychAD—since many other sc-Omics datasets profiled large numbers of cells but too few individual donors. Additionally, rather than using these datasets alone, we incorporated reference populationsto distinguish between representation issues vs. expected epidemiologic patterns, such as the higher proportion of women with Alzheimer’s disease.
Data Constraints and Mitigation
Summary
By narrowing scope to the most informative resources, applying a harmonization framework, and explicitly addressing missingness and inconsistent reporting, the project delivered credible and reproducible insights while clarifying structural issues.
Methods for Validation
To ensure quality and completeness, we systematically reviewed metadata from each dataset and cross-checked reported demographic fields against linked publications and supplementary files. Ancestry and sex categories were harmonized into canonical formats, and missingness was explicitly quantified. Statistical validation compared observed distributions to external benchmarks (global population, SEER cancer incidence, ADRD prevalence) to detect representation unfairness. All analyses were conducted in reproducible pipelines with version control to ensure integrity.
Issues Encountered
The main barrier was incomplete or inconsistent metadata. For example, >70% of HCA samples lacked ancestry annotation, while HTAN and PsychAD showed variable reporting across disease- and site-specific subsets. Terminology was heterogeneous. By documenting missingness and harmonizing across repositories, we preserved analytic rigor while transparently reporting limitations.
Our study confronted several challenges related to the reuse of data from publicly available repositories. First, the linkage between publications and datasets was inconsistent. Unlike PubMed or Google Scholar, which index/search articles systematically, there is no unified index for datasets and their meta-data.
Second, links to data were sometimes embedded in data availability statements or supplementary materials, and formats often shifted depending on journal requirements. Significant effort was required to locate and cross-reference the data in the GREI repository or supplementary tables.
Third, metadata reporting was highly uneven. Demographic variables such as ancestry and sex were often missing, reported in inconsistent formats or using non-standard terminology. To enable joint analysis, we invested substantial time in manual extraction and standardizing terms into analyzable categories, highlighting the urgent need for standardized metadata storage and reporting practices.