menu

Submission

introduction
title
Uncover causal cancer markers across populations
short description
Identify causal multi-omic biomarkers for cancer risk across diverse populations by integrating multi-omics and GWAS data
Phase 1 Submission Form
Overview / Abstract

The incidence and mortality rates of many cancers vary significantly across different racial and ethnic groups, yet identifying causal biomarkers remains challenging. This project aims to uncover putative causal genetic, epigenetic, and protein biomarkers for multiple cancer types by integrating the largest available multi-omics and GWAS datasets across diverse populations. By leveraging genetic instruments to estimate genetically regulated components of gene expression, DNA methylation, and protein levels, we will mitigate common limitations of traditional observational studies. Our approach combines state-of-the-art methods including multi-population meta-analysis, cross-cancer analysis, Mendelian Randomization, and machine learning. This comprehensive strategy will help to generate crucial insights into cancer etiology across populations, enable precision risk prediction, inform targeted prevention strategies, and significantly improve health disparities.

Secondary Analysis: Research Aims

Cancer remains a leading cause of mortality worldwide, with incidence and mortality rates exhibiting significant variability across different racial and ethnic populations. Despite extensive research, identifying causally relevant biomarkers remains challenging, particularly for non-European populations. Traditional observational studies are often limited by selection bias and confounding.

To address these gaps, we propose a comprehensive multi-omics study to identify putative causal biomarkers for cancer risk across diverse populations. We will integrate multiple data resources, many of which are deposited in GREI repositories and contributed by us:

  1. Gene expression prediction models: Derived from GTEx for 49 tissues, and population-specific prediction models of different blood cell types across European, African, Hispanic/Latino, and Asian populations derived from MESA and eQTLGen; 
  2. Methylation prediction models: Built using data from Illumina EPIC (~850K CpGs), 450K array (~450K CpGs), and WGBS (>25M CpGs) across four populations;
  3. Protein prediction models: Derived from INTERVAL (3K proteins), MESA (1.3K proteins), UK Biobank (2.8K proteins), ARIC (4.4K proteins), and deCODE (4.4K proteins) studies across four populations.
  4. GWAS data: Obtained from consortia (e.g., PRACTICAL, OCAC, PanScan, PanC4, and GECCO) for multiple cancer types, including prostate, ovarian, pancreatic, and colorectal cancers. There are ~ 19,391 cases and 61,608 controls of African population, 122,188 cases and 604,640 controls of European population, 10,809 cases and 95,790 controls of Asian population, and 3,931 cases and 26,405 controls of Hispanic/Latino population for prostate cancer; ~ 22,406 cases and 40,941 controls included in OCAC and 3,887 cases and 27,561 controls included in the CIMBA for ovarian cancer; 28,000 cases and 681,691 controls of European population, 7,576 cases and 63,034 controls of Asian population, and 1,262 cases and 11,707 controls of African population for pancreatic cancer; and 78,473 cases and controls of European population and 21,731 cases and 47,444 controls of Asian population for colorectal cancer. 

Our innovative approach employs TWAS, MWAS, and PWAS to predict gene expression, DNA methylation, and protein levels using genetic instruments and assess associations of predicted marker levels with cancer outcomes. This strategy mitigates selection bias and unmeasured confounding. To enhance robustness and generalizability, we will implement: 1) Multi-ethnic meta-analysis for population-specific and pan-population biomarkers; 2) Cross-cancer analysis for common biomarkers; and 3) Mendelian Randomization for causal inference. 

Timeline:

  • Months 1-2: Data harmonization/QC
  • Months 3-4: TWAS/MWAS/PWAS analyses, meta-analyses
  • Months 5-6: Mendelian Randomization, manuscript preparation
GREI Repository Data Sets
Open Science Framework (OSF)
Zenodo (CERN and Northwestern University)
DOI (Digital Object identifier) of GREI Repository Dataset
1. Gene expression prediction models: 1) GTEx-based models for 49 tissues: 10.5281/zenodo.3842289; 2) Multi-population models for different blood cell types: 10.5281/zenodo.7909040; 3. eQTLGen-based models: 10.17605/OSF.IO/7MXSA

2. Methylation prediction models: 1) WGBS-based models: https://osf.io/gct57/files/osfstorage and 10.5281/zenodo.8400313
; 2) 450K array-based models: 10.5281/zenodo.7325054

3. Protein prediction models: 1) multi-population models: 10.5281/zenodo.10963034
Outcomes and Outputs

Our project aims to generate a comprehensive catalog of putative causal genetic, epigenetic, and protein biomarkers for multiple cancer types across diverse populations. We anticipate that this research will yield significant insights into shared biological pathways across several cancer types and enable the development of improved risk prediction models that incorporate multi-omics data. Furthermore, our findings may unveil potential novel therapeutic targets for future investigation.

To maximize the impact of our research, we will disseminate our findings through multiple channels. Primary results will be published in peer-reviewed journals, with a preference for open-access options to ensure widespread accessibility. Our project is designed to adhere to FAIR principles, ensuring that our data and findings are Findable, Accessible, Interoperable, and Reusable. All relevant data, including associations summary statistics and analysis scripts etc, will be submitted to osf.io, leveraging its robust infrastructure for long-term data preservation and sharing. To promote Reusability, we will provide detailed data dictionaries, quality metrics, and provenance information for all datasets. Given our focus on racial and ethnic diverse populations to ensure benefits for all, we are committed to ensuring Collective Benefit, Authority to Control, Responsibility, and Ethics (CARE principle) in our research. 

To address replicability and reproducibility, we will publish detailed methods and analysis protocols alongside our results. All codes used in our analyses will be made available on GitHub with appropriate version control, including specific commit hashes for exact reproducibility. We will utilize containerization technologies such as Docker to ensure computational environment reproducibility, providing Dockerfiles and container images. Comprehensive documentation of all data processing steps, including software versions and parameter choices, will be provided in a structured format (e.g., Jupyter notebooks) to enable other researchers to replicate our findings.

By implementing these rigorous standards for data sharing, ethical considerations, and reproducibility, we aim to maximize the long-term impact and utility of our research findings. This approach will ensure that our results are not only widely accessible but also reusable and reproducible by the broader scientific community, thereby advancing our collective understanding of cancer biology across diverse populations and potentially informing future personalized prevention and treatment strategies.

Impact/ Scientific Significance

Our research project stands to make substantial contributions to cancer genomics, epidemiology, and precision medicine, while potentially revolutionizing cancer diagnosis, treatment, and prevention strategies across diverse populations.

In the realm of scientific advancement, our study will pioneer the integration of multi-omics data (transcriptomics, epigenomics, and proteomics) with genetic information across multiple cancer types and diverse racial & ethnic groups. This comprehensive approach will help elucidate the complex interplay between genetic and environmental factors in carcinogenesis, significantly enhancing our understanding of cancer biology. The novel application of TWAS/ MWAS/PWAS in a multi-population, multi-cancer context will not only advance statistical genetics methodologies but also address the critical issue of population bias in genomic studies.

The potential clinical impact of our research is profound. The identification of causal biomarkers across diverse populations could enable the development of population-specific risk prediction models with substantially improved accuracy. These enhanced models could dramatically improve cancer screening protocols, enabling earlier detection and intervention, particularly in high-risk individuals across different racial/ethnic groups. Our findings may also uncover novel therapeutic targets, potentially catalyzing the development of innovative, population-specific cancer treatments.

In the arena of prevention, our research could transform cancer risk stratification. The identification of causal, population-specific biomarkers will allow for more targeted prevention strategies, including tailored lifestyle interventions or chemoprevention approaches based on molecular mechanisms. Our cross-cancer analysis will reveal shared causal pathways, potentially identifying common prevention strategies effective across multiple cancer types. This personalized approach to cancer prevention could significantly reduce cancer incidence rates across diverse populations.

In summary, we create a foundation for systematic evaluation of cancer risk across diverse groups. This approach will reduce healthcare disparities through improved biological understanding and targeted interventions, enable development of targeted therapies, and inform population-specific prevention strategies, ultimately advancing health equity in cancer care. Furthermore, our reproducible analytical pipelines will accelerate the conduction of similar studies for other diseases, multiplying the impact of our methodological innovations.

Team

Our research team comprises two highly accomplished investigators who bring complementary expertise to this project: Dr. Chong Wu, an Assistant Professor of Biostatistics at The University of Texas MD Anderson Cancer Center, and Dr. Lang Wu, an Associate Professor and Director of Pacific Center for Genome Research at University of Hawaii.

Dr. Chong Wu is a biostatistician with extensive experience in statistical genetics and machine learning. His innovative work in integrating multi-omics data has led to publications in prestigious journals such as Nature Communications, Annals of Statistics, and Biometrics. His expertise in developing novel statistical methods is crucial for our project's aim to identify causal biomarkers across diverse populations and cancer types.

Dr. Lang Wu is an experienced genetic and cancer epidemiologist studying cancer health disparity. His groundbreaking omics integration work published in high impact journals such as Nature Genetics demonstrates his proficiency in applying advanced statistical methods to large-scale genomic and non-genetic data.

Our team's collaboration spans over five years, resulting in over 30 co-authored publications. This long-standing partnership has fostered a synergistic relationship, seamlessly integrating our expertise in biostatistics, genetic epidemiology, and cancer research. Our complementary skills and established collaboration provide a strong foundation for executing this innovative and impactful research project.

Considerations

The success of our research project depends on:

Data Quality and Integration: High-quality, harmonized multi-omics data across diverse populations will be ensured through stringent quality control and standardized integration protocols.

Statistical Methodology: Robust statistical methods are central to our project. Our extensive expertise in statistical analysis of multi-omics data will ensure the success of our research.

Collaboration and Communication: Maintaining effective collaboration between team members is vital. We have established regular team meetings and progress reviews to facilitate seamless coordination.

Reproducibility: Ensuring reproducible findings is vital for long-term impact. We will use version control for all analyses, containerize computational environments, and provide comprehensive documentation of our methods.

By carefully addressing these considerations, we aim to maximize the reliability, impact, and translational potential of our research findings.

Supporting Documents
Provide up to 10 resources for the evaluation of your secondary research project including but not limited to: ● The persistent identifier of the dataset(s), other than GREI dataset DOIs already listed above, to be used in the proposed project (where available) ● Tools/workflows or resources to be utilized in the proposed project ● Relevant references or scientific publications that directly relate to the proposed project
Non Scored Criteria
Please complete this information. It will not be scored by the evaluation panel.
Entity Participation
Participate as an independent Team (i.e., registering as a group of individuals competing together but not on behalf of an established organization, institution, or corporation)
Research Discipline (non-scored criteria)
statistical genetics, genetic and cancer epidemiology, causal inference, multi-omics, data integration
IDeA State (non-scored criteria)
No
All Team Member Information - Name, Organization, Job Title, and Email address
1. Chong Wu, Assistant Professor, The University of Texas MD Anderson Cancer Center, Email: cwu18@mdanderson.org

2. Lang Wu, Associate Professor, Director of Pacific Center for Genome Research, University of Hawaii, Email: lwu@cc.hawaii.edu
MSI (non-scored criteria)
No
Participation in prior DataWorks! Prizes (non-scored criteria)
No
Team Point of Contact Eligibility
yes
Eligibility (non-scored criteria)
Yes, I confirm that I have read and meet the terms of eligibility for this challenge