Submission

introduction

title

Crowdsourcing Spatial Analytics

short description

Transparent and participatory realtime re-analysis and micro-publication of novel findings for spatial cell biology datasets

Phase 1 Submission Form

Overview / Abstract

Most cancer patients do not benefit from groundbreaking yet expensive (costs > $200K/year per patient) immunotherapy (IO) drugs. A number of IO studies have been published across different cancer types to find spatial biomarkers that will help identify patients (un)likely to benefit from IO. We will reanalyze these previous studies (using published datasets) to find novel spatial biomarkers and perform validation queries across multiple studies.

Current GREI datasets are published in static formats that are not "re-analysis friendly". We will re-format several (>5) human cancer digital pathology datasets with a standardized/consistent data model. We will then set up a LIVE public-facing application layer that will allow interested individuals (with proper attribution) to make novel contributions with spatial biomarker discovery, spatial metrics, visualizations, and statistical validation.

Secondary Analysis: Research Aims

We will re-analyze several datasets published on Zenodo (a GREI repository), focusing on immunotherapy outcomes in a variety of cancer sites, including breast, brain, lung, and melanoma, where digital pathology captured multiple channels of imaging using for example multiplexed immunofluorescence or imaging mass cytometry (IMC). This includes both smaller cohorts with whole-slide images (WSI) and larger cohorts with tissue microarrays (TMAs), ranging from a few hundred thousand up to several million cells quantified and a few dozen to a few hundred samples per study. We also plan to perform analyses across studies that identify treatment-agnostic immune-related biomarkers, using both IO and non-IO GREI datasets.

We will restructure source GREI datasets into a normalized target data model that is semantically annotated and controlled by standard ontologies, and comprehensively covers all phases of the investigation process from data collection and raw measurements to analysis. This includes fine-grained measurements such as marker quantification and geometry resolved at the single-cell level. The model allows close alignment between the data format and analysis/application implementation, by providing a schema appropriate to standard database management systems.

Several metrics will be computed, including cell type "profiles" that summarize population sizes and their ratios, as well as spatially-enriched statistics assessing e.g. co-occurrence of cytotoxic T cells with PDL1-positive tumor cells or clustering of B cells in possible Tertiary Lymphoid Structures (TLS), and data-driven metrics derived from Graph Neural Networks. The data processing model will uniformly support parallel processing for efficient, on-demand computation. Exploratory data analysis in a completely no-code environment will be guided by statistical testing, and significant results (for example a spatial metric identifying patients most likely to benefit from IO) will be accessible as a formal record that enables reproducibility.

Timeline:

Data curated in the standardized data model: (1 month - Jan 2025)

LIVE data analysis website: (2 months - Feb-Mar 2025)

Secondary analysis findings: (2 months - Apr - May 2025)

GREI Repository Data Sets

Zenodo (CERN and Northwestern University)

DOI (Digital Object identifier) of GREI Repository Dataset

10.5281/zenodo.5903190
10.5281/zenodo.6004986
10.5281/zenodo.4300912
10.5281/zenodo.10258578
10.5281/zenodo.7990870
10.5281/zenodo.10659311
10.5281/zenodo.7637988
10.5281/zenodo.7760826
10.5281/zenodo.7796393
10.5281/zenodo.4607374
10.5281/zenodo.7884599
10.5281/zenodo.5719187
10.5281/zenodo.7961844

Outcomes and Outputs

Our project is designed from inception to support the recording and sharing of specific analysis findings using resolvable URLs as identifiers. Though we will not maintain a registry of persistent identifiers, we will maintain exact references to source datasets using DOIs and versioned source code of application components which are used in the (live) reproduction of specific findings.

Findable. Our data curation process comprehensively aggregates study metadata like publication records, ORCID researcher identifiers, ROR organization identifiers, reagent or antibody manufacturer identifiers, and ontology term identifiers for measurement apparatus or data modalities.

Accessible. Our public-facing tools will do no telemetry and will not require authorization or authentication, except when users wish to submit results and receive attribution using their ORCID ID.

Interoperable. The detailed data model is designed for the explicit purpose of allowing "cross-cutting" analyses and integration of different data sources, and interoperability with any software component that supports the model's interface.

Reusable. For maximum replicability, we will make it easy for investigators and interested individuals to completely recreate the software stack and processing pipeline from download and preprocessing to modeling and analysis.

Impact/ Scientific Significance

Our findings can help further understanding of fundamental cell biology, including cancer biology, normal tissue biology, and general functional states in immunology. Since some of the datasets are focused on longitudinal tumor evolution from the primary to metastasis state, we will be able to provide insights into the general mechanisms behind this evolution with the aim of preventing this transition.

We should be able to find novel target cell types (macrophages, neutrophils, fibroblasts, etc.) that identify biomarker candidates for specific therapies such as IO. This includes signs of disruption in coordination among the immune cells themselves, for example when antigen-experienced cells (that would normally provide co-stimulatory signals to effector cells) lack the appropriate functional markers or are present at the wrong site. Other mechanisms that we may detect include direct failure of activation of cytotoxic T cells or macrophages, or T cell exhaustion that could be a viable candidate for therapies intended to restore their function.

Due to multiple markers that can indicate different tumor clones/lineages (in the same mass) and due to cell-level spatial resolution, we will be able to assess possible configurations of tumor tissue that impede leukocyte access and effective immune surveillance, even when the patient's immune system itself is functioning normally. For example, in preliminary analysis we have observed possible instances of an MHCI-negative tumor layer or clone that physically shields the MHCI-positive tumor cells from presumably efficacious cytotoxic T cells at the tumor interface. Finer analysis of the tissue architecture and immune infiltrate may also yield clear indicators of coordination between tumor clones with respect to other disease functions like inducing angiogenesis or promoting proliferation, providing some guidance in prioritizing treatment for one clone over the other.

Team

Dr. Saad Nadeem and Dr. James C. Mathews work together at Memorial Sloan Kettering Cancer Center and have been collaborating for the past 8 years. Dr. Saad Nadeem is a computer scientist and an applied mathematician with expertise in medical image analysis. Dr. James C. Mathews is a computational biologist, mathematician, and has strong background in statistical analysis.

Considerations

We have done the preliminary work for creating the data model and a preliminary version of the LIVE data analysis website with some of the datasets listed above. Given our strong collaboration, technical skills, and the preliminary ground work, we will be able to successfully execute the proposed research project within the timeline listed above.

Supporting Documents

Provide up to 10 resources for the evaluation of your secondary research project including but not limited to: ● The persistent identifier of the dataset(s), other than GREI dataset DOIs already listed above, to be used in the proposed project (where available) ● Tools/workflows or resources to be utilized in the proposed project ● Relevant references or scientific publications that directly relate to the proposed project

Supporting Document (1)

https://github.com/nadeemlab/SPT

Supporting Document (2)

https://oncopathtk.org

Supporting Document (3)

https://adiframework.com

Non Scored Criteria

Please complete this information. It will not be scored by the evaluation panel.

Entity Participation

Participate as an independent Team (i.e., registering as a group of individuals competing together but not on behalf of an established organization, institution, or corporation)

Research Discipline (non-scored criteria)

Medical image analysis, computational biology, computer science, user interface design, medical visualization.

IDeA State (non-scored criteria)

All Team Member Information - Name, Organization, Job Title, and Email address

Team leader: Saad Nadeem, Memorial Sloan Kettering Cancer Center, Assistant Attending, nadeems@mskcc.org

Team member: James C. Mathews, Memorial Sloan Kettering Cancer Center, Senior Research Scientist, mathewj2@mskcc.org

MSI (non-scored criteria)

Participation in prior DataWorks! Prizes (non-scored criteria)

Team Point of Contact Eligibility

yes

Eligibility (non-scored criteria)

Yes, I confirm that I have read and meet the terms of eligibility for this challenge

Was this page helpful? yes no