Most cancer patients do not benefit from groundbreaking yet expensive (costs > $200K/year per patient) immunotherapy (IO) drugs. A number of IO studies have been published across different cancer types to find spatial biomarkers that will help identify patients (un)likely to benefit from IO. We will reanalyze these previous studies (using published datasets) to find novel spatial biomarkers and perform validation queries across multiple studies.
Current GREI datasets are published in static formats that are not "re-analysis friendly". We will re-format several (>5) human cancer digital pathology datasets with a standardized/consistent data model. We will then set up a LIVE public-facing application layer that will allow interested individuals (with proper attribution) to make novel contributions with spatial biomarker discovery, spatial metrics, visualizations, and statistical validation.
We will re-analyze several datasets published on Zenodo (a GREI repository), focusing on immunotherapy outcomes in a variety of cancer sites, including breast, brain, lung, and melanoma, where digital pathology captured multiple channels of imaging using for example multiplexed immunofluorescence or imaging mass cytometry (IMC). This includes both smaller cohorts with whole-slide images (WSI) and larger cohorts with tissue microarrays (TMAs), ranging from a few hundred thousand up to several million cells quantified and a few dozen to a few hundred samples per study. We also plan to perform analyses across studies that identify treatment-agnostic immune-related biomarkers, using both IO and non-IO GREI datasets.
We will restructure source GREI datasets into a normalized target data model that is semantically annotated and controlled by standard ontologies, and comprehensively covers all phases of the investigation process from data collection and raw measurements to analysis. This includes fine-grained measurements such as marker quantification and geometry resolved at the single-cell level. The model allows close alignment between the data format and analysis/application implementation, by providing a schema appropriate to standard database management systems.
Several metrics will be computed, including cell type "profiles" that summarize population sizes and their ratios, as well as spatially-enriched statistics assessing e.g. co-occurrence of cytotoxic T cells with PDL1-positive tumor cells or clustering of B cells in possible Tertiary Lymphoid Structures (TLS), and data-driven metrics derived from Graph Neural Networks. The data processing model will uniformly support parallel processing for efficient, on-demand computation. Exploratory data analysis in a completely no-code environment will be guided by statistical testing, and significant results (for example a spatial metric identifying patients most likely to benefit from IO) will be accessible as a formal record that enables reproducibility.
Timeline:
Data curated in the standardized data model: (1 month - Jan 2025)
LIVE data analysis website: (2 months - Feb-Mar 2025)
Secondary analysis findings: (2 months - Apr - May 2025)
Our project is designed from inception to support the recording and sharing of specific analysis findings using resolvable URLs as identifiers. Though we will not maintain a registry of persistent identifiers, we will maintain exact references to source datasets using DOIs and versioned source code of application components which are used in the (live) reproduction of specific findings.
Findable. Our data curation process comprehensively aggregates study metadata like publication records, ORCID researcher identifiers, ROR organization identifiers, reagent or antibody manufacturer identifiers, and ontology term identifiers for measurement apparatus or data modalities.
Accessible. Our public-facing tools will do no telemetry and will not require authorization or authentication, except when users wish to submit results and receive attribution using their ORCID ID.
Interoperable. The detailed data model is designed for the explicit purpose of allowing "cross-cutting" analyses and integration of different data sources, and interoperability with any software component that supports the model's interface.
Reusable. For maximum replicability, we will make it easy for investigators and interested individuals to completely recreate the software stack and processing pipeline from download and preprocessing to modeling and analysis.
Our findings can help further understanding of fundamental cell biology, including cancer biology, normal tissue biology, and general functional states in immunology. Since some of the datasets are focused on longitudinal tumor evolution from the primary to metastasis state, we will be able to provide insights into the general mechanisms behind this evolution with the aim of preventing this transition.
We should be able to find novel target cell types (macrophages, neutrophils, fibroblasts, etc.) that identify biomarker candidates for specific therapies such as IO. This includes signs of disruption in coordination among the immune cells themselves, for example when antigen-experienced cells (that would normally provide co-stimulatory signals to effector cells) lack the appropriate functional markers or are present at the wrong site. Other mechanisms that we may detect include direct failure of activation of cytotoxic T cells or macrophages, or T cell exhaustion that could be a viable candidate for therapies intended to restore their function.
Due to multiple markers that can indicate different tumor clones/lineages (in the same mass) and due to cell-level spatial resolution, we will be able to assess possible configurations of tumor tissue that impede leukocyte access and effective immune surveillance, even when the patient's immune system itself is functioning normally. For example, in preliminary analysis we have observed possible instances of an MHCI-negative tumor layer or clone that physically shields the MHCI-positive tumor cells from presumably efficacious cytotoxic T cells at the tumor interface. Finer analysis of the tissue architecture and immune infiltrate may also yield clear indicators of coordination between tumor clones with respect to other disease functions like inducing angiogenesis or promoting proliferation, providing some guidance in prioritizing treatment for one clone over the other.
Dr. Saad Nadeem and Dr. James C. Mathews work together at Memorial Sloan Kettering Cancer Center and have been collaborating for the past 8 years. Dr. Saad Nadeem is a computer scientist and an applied mathematician with expertise in medical image analysis. Dr. James C. Mathews is a computational biologist, mathematician, and has strong background in statistical analysis.
We have done the preliminary work for creating the data model and a preliminary version of the LIVE data analysis website with some of the datasets listed above. Given our strong collaboration, technical skills, and the preliminary ground work, we will be able to successfully execute the proposed research project within the timeline listed above.