Submission

DiSCO

introduction

title

Are Populations Fairly Represented in sc-Omics?

Phase 2 Submission Questions

Please complete these questions for Phase 2 participation.

Team Name

DiSCO

Team Member Updates (non-scored criteria)

Team member remains the same as the first submission, with the exception that Kavi has obtained her MS degree.

Kuan-lin Huang leads the team as an Associate Professor at Icahn School of Medicine, bringing extensive expertise in genomics, AI, and statistical methods for biomedical applications. His experience spans leading large-scale genomic studies and developing AI-driven approaches to improve human health.

Kavitharini Saravanan is a Master in bioinformatics at the University of Charlotte. She brings skills in data curation and computational biology, focusing on integrating large datasets.

Aryan Saharan, an undergraduate student at St. Louis University, contributes with experience in biomedical literature search and statistical analysis, supporting data preprocessing.

Catrina Yang, a medical student at the University of Oxford, brings a clinical perspective, ensuring the project aligns with medical relevance and facilitates the translation of research findings into health applications.

GREI Repository Datasets

Mendeley Data
Zenodo

Repository dataset DOI's

doi: 10.5281/zenodo.14261274

doi: 10.5281/zenodo.4739739

doi: 10.5281/zenodo.13241021

doi: 10.5281/zenodo.10515792

doi: 10.5281/zenodo.13635676

doi: 10.5281/zenodo.12752107

doi: 10.17632/pcftzv8w63.1

doi: 10.17632/vs8m5gkyfn.1

doi: 10.17632/f2v94hj7jm.1

doi: 10.5281/zenodo.7113422

doi: 10.5281/zenodo.1476122

doi: 10.17632/gncg57p5x9.2

doi: 10.5281/zenodo.4013713

doi: 10.5281/zenodo.3572422

doi: 10.5281/zenodo.8076402

doi: 10.5281/zenodo.6282659

doi: 10.5281/zenodo.7469683

doi: 10.17632/dvp4y5ttd5.1

doi: 10.17632/m4rfg9wb29.1

doi: 10.5281/zenodo.7561827

doi: 10.5281/zenodo.13863809

doi: 10.17632/7g5cftcbpv.1

doi: 10.5281/zenodo.3483177

doi: 10.5281/zenodo.4271105

(some not listed due to space)

Overview/Abstract

Single-cell Omic atlases are transforming biology and medicine, yet their demographic representativeness has not been systematically evaluated. We analyzed >13,500 samples from the Human Cell Atlas (HCA), Human Tumor Atlas Network (HTAN), and PsychAD Consortium. We found a striking, pervasive European over-representation and under-representation of Asian and Latino individuals. Nearly 70% of HCA samples lacked ancestry annotation, PsychAD was almost two-thirds European with Asians under-represented, HTAN tumors were 69% European where several cancer types also showed higher female fractions than sex-specific US incidence. These disparities highlight that current single-cell resources risk embedding inequities into AI foundational models, biomarker discovery, and therapeutic development. We provide a practical checklist for research teams to produce single-cell datasets with fair representations and complete metadata to maximize translational values.
(Preprinted/in Review in Nat. Gen)

Secondary Analysis

We performed a secondary analysis to systematically evaluate demographic representation across large-scale single-cell resources, with the overarching goal of assessing whether these flagship datasets reflect the diversity of the populations they are intended to represent.

Inclusion of GREI Repository Data

Data harmonized and curated within the GREI repository served as an enabling infrastructure for this work. Particularly, most of the Human Cell Atlas (HCA), Human Tumor Atlas Network (HTAN), and PsychAD studies had previously deposited data and meta-data through these platforms that enabled our analyses:

Human Cell Atlas (HCA): We assembled 10,125 single-cell samples across 15 tissue categories, including immune (n = 3,704), gut (n = 883), skin (n = 906), lung (n = 743), and kidney (n = 601). Metadata were obtained directly from HCA’s portal’s link to associated publications, where metadata is provided in Supplementary information and/or GREI repositories (primarily Zenodo).
Human Tumor Atlas Network (HTAN): We analyzed 1,959 tumor samples spanning 15 cancer types. Metadata files were curated and standardized to unify tumor origin and racial/ethnic reporting.
PsychAD Consortium: This dataset included 1,494 samples from three contributing cohorts (HBCC, MSBB, RADC), spanning Alzheimer’s disease and related dementia (ADRD), schizophrenia, dementia with Lewy bodies, vascular dementia, bipolar disorder, tauopathies, Parkinson’s disease, and frontotemporal dementia.

Addressing the Scientific Question

Our scientific objective was to determine whether large-scale sc-Omics resources are demographically representative and thus appropriate as universal reference frameworks. By benchmarking observed ancestry and sex distributions against global population statistics (for HCA), U.S. cancer incidence from SEER (for HTAN), and ADRD prevalence estimates from various publications (for PsychAD), we identified pervasive imbalances. Across datasets, Europeans were consistently overrepresented, while Asians and Latinos were systematically underrepresented. Sex distributions varied by disease and tissue context. These findings extend prior observations in GWAS and sequencing studies, demonstrating that inequities in representation are embedded within single-cell resources that underpin current biomedical discovery.

Models, Agents, and Technology

All analyses were conducted using reproducible computational workflows in Visual Studio Code with the R-programming language, where the codebase is publicly available via GitHub. R packages used included dplyr, tidyr, ggplot2, car, and scipy for data processing, statistical testing, and visualization. Representation ratios, chi-square tests, and diversity indices (Shannon, Simpson) were employed to quantify disparities. Equity indices were calculated to capture the magnitude of over- vs. underrepresentation. Data curation pipelines were version-controlled, and missingness was explicitly quantified across demographic fields to assess whether data gaps disproportionately affected underrepresented groups. AI-agent-assisted coding, primarily Co-Pilot that calls OpenAI and Anthropic APIs, was used in initial drafts and edits of codes, but all outputs were reviewed and validated by the investigators.

Conclusions

This secondary analysis shows that structural imbalances in ancestry and sex persist across large-scale single-cell omic datasets. European ancestry is disproportionately represented in all three resources, with Asian and Latino ancestry markedly underrepresented. While sex distribution varied, significant skews were observed in specific tissues and diseases. These biases raise concerns about the generalizability of single-cell reference maps and their downstream use in biomarker discovery and therapeutic development. Our findings highlight the need for intentional recruitment, standardized metadata reporting, and fairness benchmarks to ensure that sc-omics resources evolve into fair and representative references for human biology and precision medicine.

Outcomes and Outputs

Outcomes and Research Findings

Our project provides the first systematic diversity audit of large-scale single-cell -omic resources, revealing pervasive ancestry and sex imbalances. By analyzing over 13,500 samples from the Human Cell Atlas (HCA), Human Tumor Atlas Network (HTAN), and PsychAD consortium, we demonstrated that European ancestry is consistently overrepresented, while Asian and Latino ancestry are markedly underrepresented across tissues, diseases, and cancer types. Sex distributions varied by context: near parity in PsychAD, strong female skew in HTAN compared to US SEER estimates, and tissue-specific imbalances in HCA. These findings underscore that single-cell omic atlas datasets—often positioned as universal reference frameworks—may not generalize globally and risk embedding inequities into downstream discovery, biomarker validation, and therapeutic development.

Methods and Metadata Considerations

Our secondary analysis relied on harmonizing metadata across heterogeneous sources. For each dataset, we extracted available demographic annotations (ancestry, sex, age, disease status) directly from metadata files from linked publications, primarily from GREI repository and/or supplementary information. Missingness was systematically quantified. Disease- and tissue-level stratification ensured biologically meaningful benchmarking against external reference populations (global population shares for HCA, SEER cancer incidence for HTAN, and ADRD prevalence estimates for PsychAD).

Standards, Resources, and Tools

To ensure consistency and transparency, we employed open standards and widely adopted resources:

Data Standards: Canonical ancestry and sex categories aligned with NIH and SEER reporting frameworks; external benchmarks (United Nations, SEER, ADRD prevalence studies) provided reference distributions.
Resources: Publicly available datasets from HCA, HTAN, and PsychAD, as curated and harmonized through the GREI repository and supplementary data. Reference demographic data were drawn from the United Nations (global), SEER Explorer (cancer incidence), and literature for ADRD prevalence.
Tools: Analyses were conducted in R, using dplyr, tidyr, ggplot2, car, and scipy. Diversity metrics (Shannon, Simpson), equity indices, and representation ratios were applied to quantify imbalance.

Replicability and Reproducibility

Replicability was addressed by exclusively using open-source data and tools, ensuring that all datasets and code are accessible for re-analysis. Metadata curation steps—including manual review of supplementary materials and reclassification of ancestry categories—were logged in detail. Standardization pipelines were applied consistently across all datasets. Reproducibility was reinforced through:

Version Control and Statistical Transparency: All scripts were maintained in an open-source GitHub repository, with detailed documentation of data transformations. All statistical analyses were performed with documented code and validated.
Open Data Access: Our curated metadata tables (both after manual curation and after computational standardization) are integrated within in a Zenodo repository.

Conclusions

The outputs of this research project are both scientific and infrastructural. Scientifically, our findings highlight systemic biases in current single-cell omic resources, demonstrating the urgent need for structural reforms in recruitment, metadata completeness, and equity benchmarking. Infrastructurally, the project delivers harmonized metadata tables, standardized analytic pipelines, and reproducible workflows that can serve as templates for future diversity audits. Collectively, these outcomes advance the field by embedding considerations of population representation into single-cell biology and laying the groundwork for datasets that more accurately reflect global and U.S. populations.

Impact

Contributions to Scientific Disciplines

Our research advances the fields of genomics, single-cell biology, and precision medicine by providing a demographic audit of single-cell -omic atlases. While prior critiques have highlighted inequities in genome-wide association studies (GWAS), our work extends to cellular-resolution datasets that are rapidly becoming the foundation for biomarker discovery, drug development, and training of biomedical foundation AI models. By curating and benchmarking data across >13,500 samples HCA, HTAN, and PsychAD, we establish a fairness framework that can be applied to any sc-Omics resource, contributing to scientific disciplines in three major ways:

Representation Benchmarking: We provide a reproducible methodology to measure ancestry and sex representation, identify missing metadata, and quantify underrepresentation relative to population benchmarks.
Metadata Standards: By harmonizing reporting systems into canonical categories, our work highlights the importance of metadata completeness and consistency for cross-study comparability.
Foundation for Future Research: The tools generated here offer a template for embedding representation assessments into future studies, enabling investigators to proactively design more representative atlases.

Impact on Disease Diagnosis and Treatment

Single-cell atlases are increasingly used as reference frameworks for interpreting patient-derived samples, prioritizing disease biomarkers, and identifying therapeutic targets. However, our analysis demonstrates that these resources disproportionately reflect European ancestry, while Asian and Latino ancestry remain systematically underrepresented. This imbalance poses several risks for human health:

Diagnosis: If diagnostic classifiers or cell-type reference maps are trained on biased datasets, they may underperform in populations that are underrepresented, leading to misclassification or reduced sensitivity.
Treatment: Biased single-cell resources may obscure ancestry-associated differences in gene expression, cellular response, and disease biology, limiting the discovery of therapeutic targets relevant to different populations. This risks reinforcing disparities in drug development, where treatments are optimized for certain populations but less effective in others.

Broader Implications

Our work highlights that structural inequities in representation are embedded in single-cell datasets that underpin next-generation precision medicine. Unless corrected, these biases could perpetuate health disparities in diagnostics and therapeutics for decades. By documenting these imbalances and providing open, reproducible methods, we lay the groundwork for structural reforms in recruitment, metadata standards, and fairness benchmarking. Our findings emphasize that realizing the promise of single-cell biology for human health will require more intentional efforts to ensure that all populations are represented in the foundational resources.

Considerations

Project Completion and Revisions

The project was completed within the award period, with all planned analyses finalized and integrated into the manuscript, now preprinted and under peer review. During implementation, the scope was refined to focus on three mature resources—HCA, HTAN, and PsychAD—since many other sc-Omics datasets profiled large numbers of cells but too few individual donors. Additionally, rather than using these datasets alone, we incorporated reference populationsto distinguish between representation issues vs. expected epidemiologic patterns, such as the higher proportion of women with Alzheimer’s disease.

Data Constraints and Mitigation

Metadata gaps were widespread. To address this, we systematically reviewed linked publications and supplementary tables, recovering demographic details where possible. Benchmarking analyses were then restricted to samples with available data.
Inconsistent reporting standards complicated integration. We standardized labels into canonical categories (European, African, Asian, Latino, Other, Unknown).
Sample size disparities limited subgroup analyses for some ancestries and disease strata; in such cases, results were reported descriptively.

Summary

By narrowing scope to the most informative resources, applying a harmonization framework, and explicitly addressing missingness and inconsistent reporting, the project delivered credible and reproducible insights while clarifying structural issues.

Research Quality

Methods for Validation
To ensure quality and completeness, we systematically reviewed metadata from each dataset and cross-checked reported demographic fields against linked publications and supplementary files. Ancestry and sex categories were harmonized into canonical formats, and missingness was explicitly quantified. Statistical validation compared observed distributions to external benchmarks (global population, SEER cancer incidence, ADRD prevalence) to detect representation unfairness. All analyses were conducted in reproducible pipelines with version control to ensure integrity.

Issues Encountered
The main barrier was incomplete or inconsistent metadata. For example, >70% of HCA samples lacked ancestry annotation, while HTAN and PsychAD showed variable reporting across disease- and site-specific subsets. Terminology was heterogeneous. By documenting missingness and harmonizing across repositories, we preserved analytic rigor while transparently reporting limitations.

Research Challenges (non scored criteria)

Our study confronted several challenges related to the reuse of data from publicly available repositories. First, the linkage between publications and datasets was inconsistent. Unlike PubMed or Google Scholar, which index/search articles systematically, there is no unified index for datasets and their meta-data.

Second, links to data were sometimes embedded in data availability statements or supplementary materials, and formats often shifted depending on journal requirements. Significant effort was required to locate and cross-reference the data in the GREI repository or supplementary tables.

Third, metadata reporting was highly uneven. Demographic variables such as ancestry and sex were often missing, reported in inconsistent formats or using non-standard terminology. To enable joint analysis, we invested substantial time in manual extraction and standardizing terms into analyzable categories, highlighting the urgent need for standardized metadata storage and reporting practices.

Supporting Documents

Provide resources for the evaluation of your secondary research project including but not limited to (up to 10 URLs): ● Publicly available outputs of the secondary analysis including results, methods, conclusions and relevant metadata ● The persistent identifier of the datasets used and generated ● Standards, tools, and metadata associated with the implementation outputs ● Articles, preprints, or scientific publications on the project ● Other relevant related resource

Supporting Document (1)

https://www.biorxiv.org/content/10.1101/2025.10.01.677375v1

Supporting Document (2)

https://github.com/Huang-lab/scOmics_equity

Supporting Document (3)

https://zenodo.org/records/17161565

Was this page helpful? yes no