submission voting
voting is closed.
FAIR annotated dataset of stroke MRIs and metadata
short description
We created and shared a dataset of 2,888 acute stroke MRIs with lesion annotation and description, linked to patients' clinical profile.
Submission Details
Please complete these prompts for your round one submission.
Submission Category
Data sharing
Abstract / Overview

We included participants based on the records of our certified Comprehensive Stroke Center, CSC. The CSC provided us with the standardized demographic and basic clinical profiles of all admitted patients with diagnosis of acute brain stroke. The brain MRIs performed at admission were transferred to a repository under the University firewall. The images were de-identified and defaced and received an 8-digits identifier linked to the metadata. The stroke core was annotated, an structured radiological description was recorded, and the images were organized according to Brain Imaging Data Structure, BIDS, guidelines. The dataset is publicly shared in a professional repository (ICPSR) and can be accessed for research under Data Use Agreement


We were able to overcome the challenges of organizing and sharing a clinical dataset by the joint effort of three groups of different expertise: 1) The Stroke Center (represented by Dr. Johnson, CSC co-director) regularly records the admission and discharge basic clinical profiles of all patients admitted, as part of their participation on the national “Get with the Guidelines” (GWTG) stroke program. This provided a centralized record of all potential participants and eliminated the burden of reviewing medical records. 2) The Radiology Department (Dr. Faria, team leader) and the University IT offered the conditions to automatically transfer, anonymize and archive the large image sample under secure firewall. Dr. Faria large experience in imaging analysis and sharing of both data and image processing tools, and her collaboration with the Biomedical Engineering and Neurology Department, allowed her to execute and oversee the practical plan to create the dataset. 3) The JHU Libraries Data Service (Dr. Fearon and Dr. Lawson) aided to implement the innovative workflow for certifying data privacy, mediated by JH Medicine Data Trust Committee, providing expertise for disclosure risk review. They also supported data deposit at ICPSR

Potential Impact

Clinical data and MRIs were obtained retrospectively from patients admitted from 2009-2019 to the Johns Hopkins CSC. The goal is to provide high quality, large scale, human-supervised knowledge to feed artificial intelligence (AI) models and enable further development of tools to automate several tasks that currently rely on human labor, such as lesion segmentation, labeling, calculation of disease-relevant scores. The goal is also to provide a valuable training and testing resource for translational research relating lesion features to risk factors, brain functions, and patients’ outcomes.

We adopted all the practices recently recommended by the medical (including NIH) and informatics communities. For example, the metadata structure follows medical lexicons as RadLex and UMLS, to facilitate text indexation and Natural Language Processing. The data organization, in BIDS recommended format, is compatible, or can be easily converted to, newly developed semantic standards. This provides critical capability to generate "computable data objects", that can be readily used by the AI community and are user-friendly organized to improve access to non-expert data analysts. It also makes easy to integrate with other ongoing open science efforts, application program interfaces, modules for quality control and harmonization. These capabilities enable the use of this resource for discovery, data synthesis and augmentation, and for reproducibility and replication tests.

In addition to these practices, we recommend to search for nationally-scoped projects (like CSCs) that collect standardized records following community consensus. This prevents the burden and complexity of mining local free text sources and medical records. We also recommend to partner with librarians and data services, and ensuring the data is deposited in an appropriate repository. This has the dual benefit of offsetting the workload and cost of data maintenance and hosting, which is often underestimated by researchers, while also providing mechanisms for data usage tracking and reporting; invaluable feedback for the dataset developers. 

Our story demonstrates how available information and services can be reused to generate new data, valuable to diverse communities, and that, regardless technical and regulatory issues, it is possible to share clinical data under FAIR principles, opening endless opportunities for clinical, translational, and biomedical research. 


All the standards employed to create this dataset followed community consensus. Our data source is part of the GWTG program, that includes over 2,000 hospitals that have entered more than 5 million patient records in the national data source. Therefore, the workflow for data inclusion and metadata collection is highly feasible to other institutions. The metadata is organized in massively used free format (.tsv) and is accompanied by a dictionary; readme and change files explain details and updates. 

Regarding to the images, the adoption of BIDS guidelines guarantees comprehensive data structure and naming, compatible with further steps of image analysis. All the technical procedures (e.g., image de-identification, defacing, conversion to recommended format, and annotation) were performed with free, publicly available software (e.g., dcm2niix, fsl_daface, MRICron, ROIEditor), in computers with regular configuration, therefore are vastly replicable. A manuscript linked to the database (referenced bellow) describes in detail each step of the data organization, post-processing and annotation. Several tools to automatize image analysis and facilitate clinical research derived from this dataset (e.g., the "ADS" and "Arterial Atlas" referenced below) are publicly shared and vastly documented in platforms such as Nitrc, GitHub, and Zenodo. Therefore, not only the procedures to create the database are highly replicable but the derived tools as well.

Potential for Community Engagement and Outreach

Medicine has seen a shift from anecdotal experiences to objective, data supported evidence based on large amounts of data. In addition, AI technical developments depend on the availability of high quality, large scale, human- supervised dataset to generate and test meaningful and reproducible models. The implementation of this project, and the opportunities it revealed, strengthened the collaboration among our clinical, translational and bioengineering communities. This project also helped improve the workflow for privacy certification for shared patient data. Our compliance approach may be valuable to other institutions coping with NIH's expanded data sharing policy. 

Sharing your data is the most effective way to prove its value, and to inspire others to do the same. It is also a way to connect yourself to the community, to increase the visibility of your work, and to improve your quality indices, such as citations. We also believe that large datasets, as the one presented here, have limitless opportunities that we, or any other group alone, cannot completely explore. Sharing data is therefore one of the major (if not the major) contribution for science advancement; it is a scientific and ethic responsibility

Supporting Information (Optional)
Include links to relevant and publicly accessible website page(s), up to three relevant publications, and/or up to five relevant resources.
Supporting Documentation 01
Supporting Documentation 02
Supporting Documentation 03

comments (public)