Submission

AA's team

introduction

title

INSITE: Integrating scRNAseq for Tcell Engineering

Phase 2 Submission Questions

Please complete these questions for Phase 2 participation.

Team Name

AA's team

GREI Repository Datasets

Figshare
Zenodo

Repository dataset DOI's

scGPT processed compendium (Wang, 2024; DOI: 10.6084/m9.figshare.24954519.v1): harmonized multi-study single-cell dataset for training and benchmarking.
Skin scRNA-seq data (Tiryaki, 2023; DOI: 10.6084/m9.figshare.24079362.v1): human skin samples from individual donors, generated for optimized tissue dissociation and immune profiling.
CRISPRa/CRISPRi perturbation data in human T cells (Zenodo; DOI: 10.5281/zenodo.5784651): Perturb-seq in stimulated primary T cells.
Pan-cancer T cell atlas (Zenodo; DOI: 10.5281/zenodo.5461803): single-cell RNA-seq of tumor-infiltrating T cells across cancers.
PerturBase (Wei et al., 2025; NAR; DOI: 10.1093/nar/gkae858): curated single-cell perturbation database (122 datasets, ~5 M cells).

Overview/Abstract

We demonstrate how publicly shared single-cell RNA-seq datasets can be reused to predict gene perturbation outcomes without new experiments. Focusing on human T cells, we addressed key reuse challenges: incomplete metadata, unclear provenance, and heterogeneous repository formats. We defined environments by combining tissue, donor, and perturbation metadata, then curated and quality-controlled datasets while preserving biological variation across environments. Leveraging these natural differences, we applied Anchor Regression to learn causal gene regulatory networks, which enable in-silico knockout predictions with quantified uncertainty and optimization of multi-gene combinations for cell engineering. We release all processed data, analysis code, and an interactive web app. Our framework shows that systematic integration of existing data can yield mechanistically interpretable models that provide cost-efficient hypothesis generation for therapeutic design.

Secondary Analysis

Overview

Why this matters

Predicting gene-perturbation outcomes from existing data expands biological insight without new experiments. Exploring all combinatorial perturbations in the lab is infeasible; computational screening helps by (1) mapping causal interactions among genes to reveal mechanistic biology and (2) identifying interventions that shift transcriptional states toward desired outcomes, guiding therapeutic strategies.

Why this is a good fit for data reuse

When a relationship between two genes stays consistent across datasets and conditions—what we define as environments—it is likely causal rather than correlational. Each environment (tissue, donor, microenvironment) introduces natural variation that acts as a shift in gene expression. Capturing these shifts allows us to distinguish genuine biological mechanisms from context-specific noise. Testing across many environments strengthens the inference that invariant gene-gene links represent causal regulatory edges.

What we did

We reused public single-cell RNA-seq datasets of human CD4 and CD8 T cells from GREI repositories (Zenodo and Figshare). Because cell types and environments were inconsistently labeled, we standardized environment definitions by combining perturbation, donor, and tissue metadata into explicit labels (e.g., lung tumor, healthy, or CRISPR knockout [KO]). For each record we generated one h5ad file per distinct environment.

We curated metadata, performed QC, removed technical artifacts, and retained biological differences across environments, treating them as natural shift interventions to learn stable gene-gene relationships. Deliverables include a data model, Python library implementing the pipeline, and static website showcasing three core functions detailed below.

Three core functions

Explore. Provides harmonized embeddings (e.g., UMAP) and differential-expression analysis between any two environments for interactive comparison.

Predict. Unlike black-box deep-learning methods for KO prediction, we construct an interpretable gene regulatory network (GRN) represented as a structural causal model and learned through Anchor Regression. This approach estimates per-gene relationships while penalizing residual variation linked to environmental factors. Leveraging causal-inference principles (do-calculus), we conduct in-silico perturbations to predict mean and distributional changes across genes, enabling statistical testing (p-values and FDR) for KO versus baseline conditions.

Optimize. Given a target transcriptional profile and a KO budget, the tool searches for multi-gene combinations predicted to move the cell toward the goal, illustrating applications to rational cell-engineering design.

Results and conclusions

We identify environment-invariant regulatory edges that enable in silico perturbation prediction with quantified uncertainty and support gene-wise differential-expression testing for KO versus baseline across environments.

Existing single-cell data, when systematically organized across well-defined environments, can be reused to infer GRNs that generalize to unseen perturbations—providing a cost-efficient and mechanistically interpretable framework for hypothesis generation.

Questions

How GREI data were included

All single-cell datasets used were sourced from Zenodo and Figshare. Every record was processed independently to compute normalized, residualized expression matrices; environments were defined from metadata and records partitioned accordingly. We also incorporated PerturBase T-cell datasets (13 datasets, ~200k cells). Our processed dataset is publically released here: https://zenodo.org/records/17283912.

Scientific questions and approach

Can we learn T-cell GRNs from pre-existing data and predict KO effects with uncertainty?
Approach: Treat environments as shift interventions and apply Anchor Regression to identify stable gene-gene relationships. Resulting networks yield per-environment KO predictions with mean shifts and confidence intervals reflecting immune circuitry.

Can we identify effective combinatorial KO for cell engineering via in-silico causal models?
Approach: Using the learned causal network and set function optimization, we find feasible multi-gene KOs that move expression profiles toward user-specified targets, demonstrating a pathway from causal inference to translational design.

Models, agents, and technology used

Data model: unified JSON schema describing generated records and environments.
Languages: Python 3 (Scanpy & anndata), R (Seurat) accessed via rpy2.
Infrastructure: GitHub + Zenodo DOIs for all code + data releases; reproducible Python environments.
Interface: HTML/CSS/JS web application supporting static website (https://insiteproject.bio/) and local Python server.
Agent: Claude Code assisted with templating and documentation; outputs are human-reviewed, version-controlled, and released under permissive open-source licenses.

Outcomes and Outputs

Outcomes

Environment-invariant Gene Regulatory Networks GRNs for CD4 and CD8 learned on a user-selectable panel of top-N Highly Variable Genes (HVGs) (e.g., 50–1000), after extracting T-cell subsets and deriving per-environment files from heterogeneous GREI datasets.
In-silico knockout (KO) predictions per environment with both mean and distributional outputs, enabling p-values, FDR, and Differential Expression (DE) gene calls for KO vs baseline.
Multi-gene optimization: given a target profile and a KO budget, the tool proposes combinations of KOs that best shift the cell state toward the target.

Outputs

Public GitHub repository (https://github.com/insiteproject/insite) containing the data model, complete pipeline for data unification, QC, and causal modeling construction, plus a web application implementing explore, predict, and optimize functions.
Public Zenodo repository (https://zenodo.org/records/17283912) with processed datasets: normalized expression matrices per environment, standardized metadata, and persistent DOI for long-term access.
Model outputs (GitHub): inferred GRN matrices (CD4/CD8 T cells), per-environment expression moments, with machine-readable metadata and version control.
Interactive outputs (website): dynamically computed KO predictions (for a 100-gene panel), differential expression tables, and volcano plots for user-specified perturbations.

Methods, metadata considerations, conclusions

Preprocessing: technical-only normalization via negative-binomial GLM with library-size offsets and QC covariates to compute Pearson residuals as latent features. This mirrors the sctransform/Seurat approach conceptually, implemented in the Scanpy/anndata ecosystem for Python. We do not harmonize environments; environment differences are the signal we exploit in causal modeling.
GRN learning: Anchor Regression per gene with environment labels as anchors, using sparsity penalties and a spectral-radius stability guard to ensure learning a stable cyclic causal graph. Edges reflect relations that are consistent across environments.
KO simulation: remove incoming influence to the KO target, clamp target activity, propagate effects to obtain per-environment predicted means and uncertainty.
Conclusions: observational, multi-environment data—once carefully curated to surface cell types and environments—can yield robust GRNs that generalize to perturbational behavior, enabling hypothesis generation and prioritization without new wet-lab data.

Standards, resources, and tools

Standards & Best Practices:

FAIR principles: All datasets and code releases include persistent identifiers (DOIs), comprehensive README documentation, JSON schema specifications, data dictionaries, and full provenance tracking.
Open licensing: Processed data released under CC-BY 4.0 license; all software code under permissive MIT license for maximum reusability.

Data Standards & Formats:

Unified data model: JSON schema standardizing single-cell expression data, metadata, and environment definitions across heterogeneous datasets.
Metadata harmonization: Standardized ontologies for cell types, tissues, and experimental conditions across GREI repositories.

Computational Tools & Infrastructure:

Core languages: Python 3.11 with scientific computing stack (Scanpy, AnnData, NumPy, Pandas); R (Seurat) integration via rpy2.
Specialized methods: Sparse lasso solvers for causal inference, Anchor Regression for environment-invariant learning.
Web technologies: HTML/CSS/JavaScript application supporting both static deployment and local Python server for offline use.

Replicability and reproducibility:

Deterministic pipeline with fixed seeds, pinned dependencies (venv), and exact configs.
Frozen PID manifest listing all used datasets.
Version-controlled workflows on GitHub + Zenodo DOIs for persistent data and code releases.

Impact

Knockouts are a foundational way scientists learn how cells work—by removing a gene and observing what breaks, what compensates, and what rewires. In T cells, the stakes are especially high. We care not only about single-gene effects but also about combinations that can redirect fate, amplify desirable programs, or release cells from exhaustion. Exhaustion is a dysfunctional state that blunts T-cell responses during chronic infection and cancer; lifting it is central to modern immunotherapy. Our work shows that large, heterogeneous single-cell datasets—already in the public domain—can be reused as natural experiments to infer environment-robust gene–gene relationships and to predict the consequences of knockouts before going back to the bench.

Scientifically, the project contributes a practical template for data reuse at scale in immunology: extract cell types and environments buried across studies, retain biological context rather than harmonizing it away, and learn environment-invariant regulatory controls. From those learned regulatory rules, we compute both average and distributional KO outcomes, which yields not just point predictions but also uncertainty and differential expression (DE) evidence. This moves beyond static atlases toward a predictive atlas—one that can prioritize perturbations and combinations under a KO budget to approximate desired transcriptional states.

For human health, the immediate impact is on treatment development. In oncology, our in-silico KO screening can help nominate targets and multi-target combinations to push tumor-infiltrating T cells away from exhaustion and toward durable effector or memory-like programs—conceptually synergistic with checkpoint blockade and adoptive cell therapies (such as engineered T cells). In infectious disease, the same approach can propose perturbations that enhance antiviral or antibacterial responses without tipping cells into exhaustion, informing adjuvant and regimen design. In autoimmunity, it provides a way to explore suppressive combinations that might selectively dampen pathogenic T-cell programs while preserving host defense. For vaccines, KO-guided hypotheses can highlight regulatory axes that bias toward effective memory formation.

Because our framework quantifies uncertainty and reports DE genes per environment, it supports comparative evaluation across tissues, donors, and disease contexts—an important step toward translational relevance. While our primary influence is on treatment strategy (target nomination and combination design), the same network-level readouts can surface candidate biomarkers of response or resistance (impact on diagnosis) and suggest directions for rational adjuvanting (contributing to prevention).

The broader impact is a shift in practice: using public single-cell resources not only to describe cell states, but to forecast the effect of perturbing them—accelerating hypothesis generation and focusing scarce experimental effort where it most matters.

Considerations

Completion within award period
Yes. We delivered per-environment CD4/CD8 files, GRN learning with Anchor Regression, in-silico KO (mean and distribution), DE testing, and public releases of code, data artifacts, and docs.

Scope changes

Minor contraction. Planned analyses of additional T-cell subtypes (for example, memory, Tregs) were dropped due to insufficient, confidently annotated cells across GREI deposits.
Method pivot. We evaluated environment-based causal discovery (for example, ICP-style and related moment methods) but found them not scalable for our panels, so we selected Anchor Regression for robustness and speed.
Minor expansion. We added distributional KO predictions and parametric/Monte Carlo DE testing after recognizing that predicting average expression post KO were insufficient for decision support of end users, e.g., immunologists.

Constraints and mitigations

Metadata heterogeneity. Many deposits lacked consistent tissue/donor labels or explicit CD4/CD8 flags. We curated a controlled schema and produced derived per-environment files after subsetting to T cells.
Cell-type discoverability: CD4/CD8 T cells were often embedded within multi-cell-type studies. We retained only cells with clear subtype annotations and excluded ambiguous populations.
Technical variability. We corrected only technical factors via NB-GLM (library-size offsets, QC covariates) to compute Pearson residuals, preserving environment differences needed for invariance.

Research Quality

a) Methods to validate quality & completeness

Manual curation & triage: read deposit pages, metadata, and papers; retain only datasets with traceable provenance and clear documentation.
Provenance verification: cross-check identifiers against public records; prefer datasets referenced in peer-reviewed articles.
Schema compatibility: require clean mapping to our unified schema; exclude nonconforming deposits.
QC metrics: compute standard single-cell QC (mito fraction, library size, detected genes) using Scanpy; apply empirically tuned thresholds.
Human-in-the-loop: combine automated checks with expert review to catch outliers, mislabeled samples, and incomplete annotations.

b) Issues encountered

Inconsistent metadata and ambiguous labels required reviewing the corresponding publication for verification.
Aggregated records lacked documentation of component sources, requiring manual provenance tracing.
Gene nomenclature heterogeneity: reconciled at least three conventions into a unified reference

Research Challenges (non scored criteria)

Reusing heterogeneous research data presented several challenges that we systematically addressed. Many datasets lacked consistent metadata, documentation, or standardized formats, making it difficult to integrate them into a unified framework. To overcome this, we developed semi-automated curation tools combined with manual triage following human-in-the-loop guidelines to assess data completeness, provenance, and compatibility. Another major challenge was reconciling differing gene identifiers, annotation systems, and naming conventions across repositories; this was mitigated by implementing mapping scripts and cross-checking against authoritative reference databases

On the analytical side, variability in preprocessing pipelines, sequencing depth, and quality metrics required harmonization before comparative analysis. We employed batch correction and normalization procedures, coupled with robust quality control metrics from established frameworks such as scanpy, to ensure consistency.

Supporting Documents

Provide resources for the evaluation of your secondary research project including but not limited to (up to 10 URLs): ● Publicly available outputs of the secondary analysis including results, methods, conclusions and relevant metadata ● The persistent identifier of the datasets used and generated ● Standards, tools, and metadata associated with the implementation outputs ● Articles, preprints, or scientific publications on the project ● Other relevant related resource

Supporting Document (1)

https://insiteproject.bio/

Supporting Document (2)

https://github.com/insiteproject/insite

Supporting Document (3)

https://zenodo.org/records/17283912

Supporting Document (4)

https://zenodo.org/records/5461803

Supporting Document (5)

https://figshare.com/articles/dataset/Processed_datasets_used_in_the_scGPT_foundation_model/24954519?file=43939560

Supporting Document (6)

https://zenodo.org/records/5784651