Submission

introduction

title

Enriching Stressful Life Events collection

short description

Using data from Reddit and similar platforms, we aim to identify a collection of stressful life events relevant to contemporary society

Phase 1 Submission Form

Overview / Abstract

Scales measuring life events, such as The Holmes and Rahe Social Readjustment Scale and its current adaptations, focus on widespread events and often lack sensitivity required for contemporary individuals and population groups. Our recent review showed that research only uses commonplace events, such as getting divorced, while ignoring events crucial for immigrants (such as forced displacement) or for non-binary individuals, to name a few examples.

Stressful life events are important for a range of mental health issues, suicidal behaviour, and a range of physical conditions. Our recent project demonstrated that some events, such as losing a home or getting arrested, act as significant barriers to vital cancer screenings.

Our project proposes to leverage natural langauge processing and datasets based on Reddit’s depression forum, and similar, to update stressful events collections. The insights gained will be invaluable in enhancing the scientific understanding of stress and its impact.

Secondary Analysis: Research Aims

1. Research Aims:

• The primary aim of this project is to update the existing scales of stressful life events to add modern realities and minority-specific events. By doing so, we hope to enhance the relevance of these scales in health research and practice.

• We will reuse datasets based on Reddit data, such as depression and suicidewatch subreddits, as our main data source. These subreddits are repositories of real-world experiences related to mental health struggles, providing unfiltered access to discussions of unique life challenges, situations, and stressful events. Our preliminary analysis showed these datasets include unique information, for example, related to discrimination and bullying, that needs to be systematically collected for future use in research and clinical practice

• The secondary aim of this project is to provide quantifyable measures of the new stressful events captured. This will enable us to understand which events are more widespread. We understand that platforms, such as Reddit, have a unique demographic composition, not generalizable to the whole population. We will mention this limitation when publishing our frequency statistics. Some subreddits are dedicated to minority groups, if we identify those in GREI repositories ad-hoc we will reuse them, as they are an important source of information on stressful events.

2. Data Utilization:

• In compliance with the requirements of this challenge, we will also incorporate data from the Generalist Repository Ecosystem Initiative (GREI). The primary dataset for this project contains over 6000 Reddit posts from the depression and suicidewatch subreddits. However, we plan to identify and reuse additional similar datasets coming from other sources (for example, twitter data related to depression and suicide). An additional depression-related dataset derived from 43 subreddits can be found in another GREI repo. We believe there is plenty of data in GREI ecosystem that we can reuse for our project.

3. Methods:

• We will apply topic modeling using Latent Dirichlet Allocation (LDA) and large language models (LLMs) to identify new and relevant stressful life events, particularly those affecting minority populations. We will experiment with different values of K (number of topics) to capture a nuanced and high-level group of events. We will experiment with different LLM prompts to ensure their sensitivity.

• We will collect information on how many times each event was mentioned to build frequency statistics on how often each type of event or event group is mentioned

• After making a preliminary collection of events, we will consult our mental health and trauma healthcare team to help us group and standardize events (for example, discrimination based on ethnicity or gender might be grouped in a single block "discrimination")

• The project is projected to span 4-6 months, with 1-2 months for modeling and analysis, 2 months for validation, and 1-2 months for reporting.

GREI Repository Data Sets

Mendeley Data
Zenodo (CERN and Northwestern University)

DOI (Digital Object identifier) of GREI Repository Dataset

https://doi.org/10.17632/zz8j36y24f.1
https://doi.org/10.5281/zenodo.7703193

Outcomes and Outputs

Research Findings

The primary outcome will be an updated scale of stressful life events, focusing on those relevant to minority populations. The findings will highlight previously underrepresented stressors, contributing to more effective health interventions.

Dissemination

Findings will be published in open-access journals, presented at relevant academic conferences, and shared through public health forums. In addition, we will disseminate our findings on platforms like Zenodo, Figshare, and pertinent other repositories to ensure that the data and results are freely accessible to the wider scientific community.

FAIR & CARE Principles

All datasets, analyses, and findings will be assigned persistent identifiers (e.g., DOIs) to ensure long-term discoverability. We will document the sources of data clearly (i.e., GREI repository datasets) and the methodology used, ensuring that researchers and practitioners can easily locate our work. The documentation and codebase related to data processing and analysis will be shared openly on platforms like GitHub to provide full transparency and facilitate access.

Reproducibility

We will provide the code used for our secondary analysis on GitHub, ensuring that others can replicate and build upon our work.

Impact/ Scientific Significance

Scientific Contribution

The project will significantly enrich our understanding of how stressful life events are evolving, particularly for minority groups. By addressing gaps in current stress scales, this research can help create a more inclusive approach to mental health support.

Impact on Diagnosis, Treatment, and Prevention

These scales will be used by our biomedical informatics team in research on stressful life events to assess how they impact a wide range of health outcomes: from cancer screening adherence to mental health problems.

We will work also with our software team to enrich our electronic health records system to support the assessment of a larger range of life stressors, paying particular attention that minority stressors are adequately collected.

The updated life event scales will enable health professionals to make more accurate diagnoses, particularly in underrepresented populations.

Team

Our interdisciplinary team comprises health informatics professionals and NLP experts with extensive experience in secondary data analysis. We successfully collaborated on previous projects involving large-scale data collection, topic modeling, and LLMs.

Our team consists of

Dr. Dmitry Scherbakov (team captain) - third-year postdoc in biomedical informatics with a strong interest in mental health disparities and how to use data science methods to address them.

Dr. Paul Heider (NLP consultant) - Assistant professor of biomedical informatics, our NLP brain and expert, with industry experience and background in linguistics.

Dr. Jihad Obeid (social determinants and NLP consultant) - Associate professor of biomedical informatics, our lead on social determinants of health data availability in electronic health records.

Considerations

Key considerations for the success of this project include ensuring ethical data collection and anonymization processes, addressing potential biases in the data, and validating the relevance of identified life events through consultation with mental health professionals.

Supporting Documents

Provide up to 10 resources for the evaluation of your secondary research project including but not limited to: ● The persistent identifier of the dataset(s), other than GREI dataset DOIs already listed above, to be used in the proposed project (where available) ● Tools/workflows or resources to be utilized in the proposed project ● Relevant references or scientific publications that directly relate to the proposed project

Supporting Document (1)

https://doi.org/10.1093/jamia/ocae023

Supporting Document (2)

https://doi.org/10.2196/preprints.66230

Supporting Document (3)

https://doi.org/10.17632/zz8j36y24f.1

Supporting Document (4)

https://doi.org/10.5281/zenodo.7703193

Non Scored Criteria

Please complete this information. It will not be scored by the evaluation panel.

Entity Participation

Participate as an Entity (i.e., registering as a group of individuals competing together on behalf of a legally established organization, institution, or corporation)

Legal Entity Organization Name

Biomedical Informatics Center,
Medical University of South Carolina
22 WestEdge St., Ste. 200
Charleston, SC 29403

Research Discipline (non-scored criteria)

Mental health
Public health
Data Science
Natural Language Processing

IDeA State (non-scored criteria)

Yes

All Team Member Information - Name, Organization, Job Title, and Email address

Dr. Dmitry Scherbakov (team leader) - Biomedical Informatics Center, Medical University of South Carolina, Postdoctoral scholar, scherbak@musc.edu
Dr. Paul Heider - Biomedical Informatics Center, Medical University of South Carolina, Assistant professor, heiderp@musc.edu
Dr. Jihad Obeid - Biomedical Informatics Center, Medical University of South Carolina, Professor, jobeid@musc.edu

MSI (non-scored criteria)

Participation in prior DataWorks! Prizes (non-scored criteria)

Team Point of Contact Eligibility

yes

Eligibility (non-scored criteria)

Yes, I confirm that I have read and meet the terms of eligibility for this challenge

Was this page helpful? yes no