menu

Submission

introduction
title
geoPIPE: reusing open data and letting data flow
short description
We developed a pipeline for enriching open data streams with geospatial analyses and natural language processing.
Submission Form - Scored Questions
Please complete all questions related to your project. All questions in this section are scored.
Category of submission
Data reuse
Overview / Abstract

We present our work in developing an open-source tool designed to encourage the reuse of open data released to the public through open data portals.  geoPIPE (see Supporting Information #1) is our custom geospatial pipeline for enhancing open data that works by sending the data through a series of tasks to transform raw data into actionable data and to help fill the gap between open data release and data use. We focus on geospatial and natural language processing (NLP) tasks which generate variables commonly required in biomedical studies. By letting data flow through geoPIPE, a derived data set specifically designed for reuse is generated for research.

geoPIPE makes using patient location data easier for researchers interested in how location impacts public health by facilitating geocoding and linking to commonly used reference data. The coordinates can be intersected with neighborhood boundaries from the US Census Bureau and the associated social environmental determinants of health.  Our geoPIPE methods apply to location-based analyses of any biomedical concept and, as an example, we highlight our impact on opioid research. Our NLP tasks convert text fields, such as cause of death, into indicator variables that signal the presence of a desired term, such as cocaine or opioids.

geoPIPE was constructed using open data and open-source libraries and in return, we provide our code, tools, and open data back to the biomedical research community. 

Data sharing or reuse recipe title

geoPIPE: open-source software for enriching open data with geospatial analyses and natural language processing

Data Sharing or Reuse Practices

geoPIPE (Geospatial Pipeline for Enhancing Open Data for Substance Use Disorders Research) provides infrastructure for deriving important location-based data from open data sets (See Supporting Information #1 – which was presented at the 2022 AMIA Annual Symposium). This tool is designed to work with popular open data application programming interfaces (APIs), such as Socrata. We expanded overdose death records from Cook County, Illinois with geoPIPE to geocode records, calculate distances to nearby points of interest such as pharmacies, and calculate contextual information such as land use classifications. We also extracted drug names from the cause-of-death fields to add binary variables for each drug found. For example, if the cause of death mentioned 'cocaine', a binary variable for 'cocaine' was created and set to true. 

We demonstrated the power of our methods by providing hotspot maps of fatal overdoses and maps of distances between those overdoses and pharmacies (see figures in Supporting Information #1). These maps show how overdoses geospatially cluster and these hotspots can then be tied back to location-based social determinants of heath (SDOH). Pharmacies carry naloxone, a life-saving opioid overdose reversal drug; short distances between overdoses and pharmacies indicate missed opportunities for overdose intervention; long distances indicate future opportunities in community planning of resources aimed at lowering the occurrence of opioid overdose. A key impact of our tool development is that contextual factors based on location or proximity will now be unlocked and available for researchers across a myriad of disciplines; in particular, environmental factors are crucial for research on substance use disorders. Most importantly, we returned our expansion of Cook County’s data back to our open repository and made the platform itself available as open source so that other open data sets may be expanded for the public good as any researcher can use these tools and data. Our tool updates our open data releases weekly to our GitHub repository (see Supporting Information #2).  GeoPIPE makes the 'recipe' for data reuse simple: a configuration file controls what data is being pulled and what tasks should occur (geospatial, natural language processing, data linkage). In addition to Cook County, Illinois, we also provide examples in our repository of how to use geoPIPE with open data portals for Milwaukee County, Wisconsin and San Diego County, California.

We believe that the future of biomedical research depends on open data adhering to the principles of FAIR (Findability, Accessibility, Interoperability, and Reusability) principles. Our tools are available online for anyone to download and use. Documentation is provided online; our open data contributions have data dictionaries. We also acknowledge the importance of CARE (Collective benefit, Authority to control, Responsibility, Ethics) Principles for Indigenous Data Governance. Because important location-based SDOH information is often missing from open data, geocoding helps unlock location-based research. Note that this is not necessarily a failure of the maintainers of open data; the data was released for transparency in good faith that it would be useful to society; our tool closes the information gap between raw data and data typically needed by biomedical researchers. Our goal is to support health equity research by providing important contextual SDOH to all researchers, which especially benefits those without the capacity or time to process raw data with geospatial or natural language processing libraries. 

We believe that open-source software is the key to reproducible and equitable science, where tools are publicly available and do not act as a barrier in adoption of research ideas. Commercial products naturally serve their role in the technology and science domains, although cost can limit who is capable of participating and prevent equal access to research. Our software facilitates data reuse by enriching existing open data sets with sharable information that biomedical researchers typically need for research studies (such as SDOH). Our software is layered on top of existing open-source software, and in return, we also release our work as open-source so others may benefit or join us in collaborative efforts.

Impact

The components of our pipeline use only open-source software and libraries. This minimizes the cost of working with location data and helps ensure equal opportunity. There are expensive commercial products that do some of the tasks discussed earlier, such as geocoding, but these products are costly and require specific training and expertise to successfully navigate.

Our work has a catalytic impact on biomedical research at large; in particular, we promote the exploration and adoption of open data practices and open-source software for those interested in identifying and understanding neighborhood-level contextual factors associated with patient location. Geocoding is not new, but it can be labor intensive, and our tools make the process of geocoding and obtaining results easier at scale. The pipeline discussed was developed to act as local infrastructure to support geospatial biomedical science and to contribute data to our local research team and the public. 

We generate results that other research studies may use. We provide our results as open data to match the sentiment of the original open data; this saves others time and effort. Because our open data sets are updated weekly, we can continuously push new results to the public. Our tools promote data reuse by enriching them with useful variables needed for research; we unlock data hiding in its raw form by processing it with geospatial and natural language processing libraries. This pushes the data into research arenas capable of using it off-the-shelf for the greater good of society. Geospatial analyses of patient addresses unlock pivotal contextual SDOH data and opens the door for testing regional differences observable in research studies.

We encourage other researchers to explore and adopt open data practices. Our collaborators at the Cook County Medical Examiner’s Office, home of Chicago, were pioneers of releasing data records as open data. In early 2022, we began discussions with other medical examiner offices to collaborate and share data; having Cook County as a case study helps in assuring people that data can be shared safely without issue. The most compelling part of our data sharing journey is that our contributions allow the processing of address data from open data portals and the conversion into important SDOH data to give a full 360 view of a person’s living environment; the added benefit is that the resulting SDOH data are less sensitive than exact addresses and easily shared.

How to learn from this project

Our tool depends purely upon open-source software that is freely available and usable by anyone. geoPIPE tools use publicly available open data sets and our code is published and available online (see Supporting Information 1 and 2).  Because our tools are all open-source, replication is as easy as downloading the tool’s code and running it. Additionally, our results are posted weekly into our GitHub repository as an open data release; researchers may download and use these files directly without needing to replicate any of the processing that generated the data’s additional fields.

We encourage others to use open-source software, packages, or code and to make their own usage and development open-source as well. We also advocate for people to adopt open data practices when possible. Our geoPIPE tool encourages open data as an output that can comply with FAIR standards, where documentation, data, and code are all available in the same repository. Our Cook County expansions via geoPIPE (discussed in question 1 above) are updated weekly with full transparency on from where the Cook County data originated and from where the code required to generate our contributed fields is located. We have since added examples of using other open data sets with our geoPIPE tool that provide the same calculated fields in different contexts; our hope is that interested researchers in the biomedical community will be able to substitute the data set of their interest to take advantage of our open-source tools and utilize our output for their specific research tasks or questions. Our success in Cook County’s Medical Examiner’s Office has enabled us to begin conversations with other counties with open data portals who would benefit from our tools; some of these conversations have led to collaborative opportunities and will lead to future funded research projects. In addition to Cook County, we also have additional results posted for the San Diego Medical Examiner’s Office in California and the Milwaukee County Medical Examiner’s Office in Wisconsin.

We will continue to work on open-source geospatial tools to advance both the importance and ease of sharing location-based data for research.

Adoption of practice by peers

If selected as an awardee, we would promote our team’s success and our solution via social media and actively recruit collaborators interested in applying geoPIPE to new data sets. We would participate in DataWorks! events to promote our tool and to demonstrate how customizable our solution can be when adopting a new data set.  We would be thrilled to find strategic partners among public health offices that wish to explore open data.

We are working on a manuscript that details key considerations when adopting a new data set and processing it with geoPIPE. The fundamental issue is that there may be variation in what is included in the open data set.  We countered this limitation by having geoPIPE run on configuration files that are easily adjustable to a new environment and a new data set.  If a new data set does not have address data, the configuration file will allow the geospatial analysis components to be disabled. Similarly, the natural language processing components may be disabled if the source data does not have unstructured text fields. Furthermore, the specific topics to be searched within the unstructured text fields are highly configurable.  A simple JSON file with entries corresponding to a list of lexicons for specific topics; for example, a stimulant category may contain 'cocaine', 'amphetamine', and various other drugs.  The NLP component will attempt to detect misspellings in the data, such as 'cociane' instead of 'cocaine', using string similarity metrics, and will report matches within a configurable threshold of scores. We made several examples of how to configure this feature publicly available on our GitHub site.  We advocate that peers review and become familiar with the data they wish to process with geoPIPE as familiarity with the data will make customizing the configuration a simple exercise.

We wish to show the research potential of open data and want to be advocates for other jurisdictions releasing open data. Chicago, San Diego, and Milwaukee are great examples of open data working at very large scales, and we want other cities to follow in their footsteps.  We would love to use the momentum of becoming an awardee to approach leaders in our state to adopt open and transparent data practices.

Team

Our team consists of research collaborators from the Institute for Pharmaceutical Outcomes and Policy (IPOP) at the University of Kentucky (UK).  Our team has previously collaborated on several open-source geospatial initiatives and believes that open data and transparency are the future of biomedical research. 

Dr. Daniel Harris is an early-stage investigator, a computer scientist, the Director of Clinical Research Analytics for IPOP, and the Director of Data Management and Security for the Kentucky Injury Prevention and Research Center. Dr. Harris was born in rural KY and is interested in researching the impact of location on health and equitable access to resources. Nicholas Anthony is a data scientist and software developer from Florida who works remotely for IPOP.  Dr. Chris Delcher is an epidemiologist and the Director of IPOP; he oversees the institute and its research direction.  

IPOP assists in operating a HIPAA-compliant data center on campus, which manages several commercial data sets and data from our university’s health system.  The mode of data sharing is driven by the sensitivity of the data. Sharing HIPAA-protected data sets with external partners requires legal agreements, while the other less sensitive derived data sets that we maintain are contributed to open repositories.  We firmly believe that open data is important to fairness and equity in any research domain; we have invested and volunteered countless hours in developing open-source tools to make working with open data easier.

Prior to our work, our healthcare system had no geospatial capabilities due to privacy concerns; our results are added to our data warehouse and made available to anyone on campus or to our research partners. 

Key principles
Open data; transparency; open-source; replicability
Video: How to learn from this practice
Research disciplines
biomedical informatics; substance use disorders; GIS
Supporting Documentation
Include links to relevant and publicly accessible website page(s), up to three relevant works that resulted from using your data sharing or reuse recipe or which were integral to the development of these practices publications, and/or up to three relevant resources.
Team information - Not Scored
Please respond to these questions related to team participation in the challenge. Your responses to questions in this section will not be scored by judges.
Entity Participation
My team is participating as an independent team
IDeA State Status (not scored)
n/a, I am not participating as part of an entity
Minority Serving Institution (not scored)
n/a, I am not participating as part of an entity
Participation in 2022 DataWorks! Prize
yes, our team captain or majority of our team participated in the 2022 DataWorks! Prize
2022 DataWorks! Prize Team name
POP-CATS
Eligibility Requirements
yes, I have read and understand the eligibility requirements

comments (public)