StrokeFAIR: a public dataset and analytical tools
short description
We share FAIR images, metadata, and analytical tools for acute brain stroke, democratizing avenues to perform reproducible reliable research
Submission Form - Scored Questions
Please complete all questions related to your project. All questions in this section are scored.
Category of submission
Data reuse
Overview / Abstract

To extract meaningful and reproducible models of brain function from stroke images, for both clinical and research proposes, is a daunting task severely hindered by the great variability of lesion frequency and patterns. Large datasets are therefore imperative, as well as fully automated image post-processing tools to analyze them. We created and shared public resources that consist of 1) a dataset of 2,888 multimodal clinical MRIs of patients with acute and early subacute stroke, with manual lesion segmentation, and metadata, and 2) user-friendly tools for processing and inspecting this dataset (and other stroke data), accessible to non-experts. The dataset provides high quality, large scale, human-supervised knowledge to feed artificial intelligence models and enable further development of tools to automate several tasks that currently rely on human labor, such as lesion segmentation, labeling, calculation of disease-relevant scores, and lesion-based studies relating function to frequency lesion maps. The tools for data processing and analysis improve the compliance of the dataset with FAIR principles, enable federated analysis and lower the barrier for non-expert usage and for producing reliable, reproducible, collaborative research.

Data sharing or reuse recipe title

We share a dataset of 2,888 patients with acute and early subacute stroke. It includes demographic information, basic clinical profile, diverse protocols and MRI modalities, all in structured format (e.g., BIDS), following community consensus. We also created and share analytical image tools that improve the compliance of the dataset with FAIR principles, enabling federated analysis and secondary data sharing, and democratizing access for researchers of diverse backgrounds and assets

Data Sharing or Reuse Practices

The resource consists of multimodal clinical MRIs of patients admitted with acute or early subacute stroke at our Comprehensive Stroke Center, CSC, between 2009-2019. The CSC provided us with the standardized de-identified patient records, comprised of demographic information, basic clinical profile, including National Institutes of Health Stroke Scale (NIHSS) scores, hospitalization duration, biometric screening at hospital admission and discharge, and associated health conditions. The brain MRIs performed at admission were transferred to a repository under the University firewall. The images were de-identified and defaced and received an 8-digits identifier linked to the metadata. The stroke core was annotated, and a structured radiological description was recorded. To expand the access to the dataset, and in addition to the images in the native space, we offer the images mapped to common coordinates and intensity normalized, since these are common steps required for most of the imaging processing pipelines. The data format and organization follow the Brain Imaging Data Structure, BIDS, guidelines (Gorgolewski, Sci. Data 2016, 3, 1–9) facilitating navigation and sharing. These is all aligned with the FAIR principles of interoperability and machine readable. 

All the standards employed to create this dataset followed community consensus. Our data source is part of the “Get with the Guidelines”, GWTG, stroke program that includes over 2,000 hospitals that have entered more than 5 million patient records in the national data source. Therefore, it is highly feasible to other institutions to reuse the workflow for data inclusion and metadata collection to build datasets interoperable with ours. The metadata is organized in massively used free format (.tsv, a non-proprietary version of excel) and is accompanied by a dictionary; readme and change files explain details and updates. All the technical procedures (e.g., image de-identification, defacing, conversion to recommended format, and annotation) were performed with free, publicly available software (e.g., dcm2niix, fsl_deface, MRICron, ROIEditor), in computers with regular configuration. Therefore, the whole process is vastly replicable and constitutes a new and freely available workflow for the creation of datasets according to FAIR principles that can be widely used by others. A manuscript linked to the database (Scientific Reports, 2023. DOI 10.1038/s41597-023-02457-9) describes in detail each step of the data organization, post-processing and annotation. 

Because this data is organized under waiver of patient consent, they cannot be released in open source repositories. We partnered with our data librarians to find a repository for restricted data that would offer free access, imposing minimum restrictions and yet ethically-responsible rules for data access. We chose ICPSR (, a professional repository with more than 60 years of experience in data archive and organization, recommended by the NIH new guidelines for data sharing. This has the dual benefit of offsetting the workload and cost of data maintenance and hosting, which is often underestimated by researchers, while also providing mechanisms for data usage tracking and reporting, invaluable feedback for the dataset developers. The data are directly downloadable from the ICPSR website, under a Data Use Agreement (DUA) for IRB approved researchers. There are no requirements in the DUA other than to cite the resource in future research products.

We also created and publicly share resources to improve the compliance of the dataset with FAIR guidelines to reflect the best of Transparency, Responsibility, User focus, Sustainability, and Technology (TRUST) principles. These are free, easy-to-use tools for visualizing, adjusting, preprocessing, and inspecting this dataset and other stroke data. The most representative are the “Acute-stroke Detection and Segemtnation” (ADS) tool, and the 3D digital brain atlases of vascular territories (listed in support documentation). All these tools were created with public coding platforms and resources (mostly in Python) and are all available in public sites as Neuroimaging Tools and Resources Collaboratory, NITRC (, and GitHub. These tools transform the data in readily useful “data computable objects”, reducing the technical barriers and democratizing the use of the resource. Therefore, users of pure clinical background, or pure neuroimaging expertise, can comprehend and use the data, even if unfamiliar with all the aspects of each other areas, producing reliable and reproducible research. It also eliminates HIPPA restrictions making the post-processed data openly sharable, and facilitating collaboration, in a “federated” approach.


This resource enables clinical researchers to advance in clinical modeling and prediction. The data organization, in BIDS recommended format, is compatible, or can be easily converted to, newly developed semantic standards, such as the NeuroImaging Data Model (NIDM). This provides critical capability to generate "computable data objects", that can be readily used by the AI community and are user-friendly organized to improve access to non-expert data analysts. It also makes easy to integrate with other ongoing open science efforts, analytical pipelines (such as in brainlife,, application program interfaces (APIs), modules for quality control and harmonization, and indexing and management engines. These capabilities enable the use of this resource not only for discovery, but also for data synthesis and augmentation, and to aid reproducibility and replication studies.

Specifically, the dataset presented here could be used to train, test, or "transfer learning" to algorithms for lesion segmentation, providing highly important metrics for acute treatment, such as the volume of the ischemic core and perfusion deficits. For example, we developed a public, user-friendly tool for ischemic lesion segmentation and quantification ( and created the first public digital atlas of brain arterial territories ( This dataset and derived tools has also been used to examine the overlap of the lesion with specific brain structures, enabling lesion symptom mapping and the automated calculation of relevant clinical scores, some of them crucial on the decision of acute treatment ( It also enabled us to develop image retrieval engines and to generate automated radiological reports ( We also used the dataset and tools as training and tesing resources to study anatomic-functional relations (;, to explore bias in clinical measures (, to study population trends (, to test hypothesis developed in external smaller datasets (, and to develop prognostic markers (

This data sharing also enabled various collaboration efforts, both at our university and at national and international levels, never possible before. At our University, we collaborate with the BME, the Department of Applied Mathematics, the School of Engineering, the School of Computational Science, the School of Statistics and Public Health, and several departments within the School of Medicine. Externally, we collaborate on theses (e.g., students in San Jose University, CA), on clinical and engineering studies and projects (e.g., with the University of South Carolina, to create open-source and free solutions to solve problems unique to stroke research; with the Mayo Clinic for lesion-mapping studies; and with University of Sussex, UK, for AI modeling; with the University of Campinas, Brazil, for analysis of clinical markers). Our dataset related tools are also being applied to process and analyze other datasets (e.g, Aphasia Research Cohort (ARC) repository, Thanks to the 2022 DataWorks! Challenge, we now collaborate with the BrainChart group (, which we met in the winner’s seminar, to use the best of our expertise on both BrainChart and StrokeFAIR projects. This sharing also promoted partnerships with the private sector to accelerate the development of reproducible and accessible tools and models to analyze clinical stroke images.  

At large, the sharing of this dataset and associated tools addresses an important equity problem. It allows underfunded institutions to perform similar research, or to analyze similar metrics in their population, or to test their own hypothesis, reducing the bias in medical research, and in associate AI methods. It lowers the barriers for data usage, ensuring access not only the savviest of scientists and clinicians, but rather, expanding the usage to creative researchers of clinical background, who may find the terminology and tools used for imaging analysis daunting; and to neuroimaging experts and computer scientists, who may be unfamiliar with the brain injury. Finally, it empowers reusable clinical neuroimaging datasets, as we provide a robust yet simple and extendable framework for analyzing clinical data and, potentially, for users to share their data. In summary, it shifts the current research paradigms for clinical neuroimaging data by enabling sharing, reproducibility, and re-use.

How to learn from this project

The most important lesson we learned, which we deeply hope to share with others, is that it is possible, feasible, and extremely valuable to share clinical data. If the data is under waiver of patient consent, they can still be shared with minimal restrictions that do not limit access. If the restrictions are still prohibitive, we developed user-friendly tools in order to implement a federated approach that will enable collaborators to apply similar pipelines on their own data.

The methods for the creation of this dataset are largely reproducible and can be completely replicated, and hopefully optimized, by others. It is highly feasible to other institutions to reuse the workflow for data inclusion, and metadata extraction and organization. All the procedures used for data anonymization, organization, and processing are public. The tools we developed and provide, also public, improve the compliance with FAIR principles and allow researchers of diverse expertise to analyze and share their own clinical data. The workflow is easily extensible to other data modalities, disease models, and organs. For example, using the same workflow, we are currently organizing a Computed Tomography (CT) dataset. Our results from the initial explorations of this data, as well as prospective research, can be replicated by others, applying different approaches to the same underlying data, and new data can be used to validate them using the reproducible software we developed for the original dataset. In the near future, we hope our resource and tools will be used for radiomics applications and educational proposses.

This project also made us aware of the crucial importance to strengthen the collaboration among clinical, translational, and bioengineering teams. It also revealed the need to partner with experts in bioethical regulations and data services. For example, our librarians guided the data archive in an established repository in the public health community, accessible to researchers worldwide, which eliminated the workload and cost of data maintenance and hosting from us, the developers. This project also helped improve the workflow for privacy certification for shared patient data. This project Data Management Plan (DMP) won the 2023 DataWorks! DMP Challenge Prize. Our compliance approach may be valuable to other institutions coping with NIH's expanded data sharing policy.

Adoption of practice by peers

Despite their recent public release, our dataset and tolls are already extremely welcome by the research community. The analytical tools sum more than 2,100 downloads. The dataset has almost 200 downloads only 3 months after the successful implementation of the delivery model, even before the key milestone publication (Scientific Reports, 2023. DOI 10.1038/s41597-023-02457-9). Both the dataset and analytical tools have been accessed by researchers all around the world, in all continents. Many of these researchers maintain their own private datasets. There is no question of their willingness to become contributors in addition to users. Among many other aspects, external data contributions will reduce the bias in the data and related tools, by including under-represented demographics and populations. We hope the DataWorks! Challenge will help us to disseminate our plans and ideas among peers, and on succeeding in applications to fund our efforts to expand the dataset and open it for users’ inputs. The DataWorks! Challenge will be also a channel to disseminate training: we are creating a comprehensive suite of video tutorials that walk researchers through the processes of lesion demarcation, BIDS repository creation, image preprocessing, and statistical analysis. All these efforts will be boosted by the DataWorks Challenge notoriety.

Our first recommendation for anyone interested in sharing (or shared) data is to collaborate, particularly with people of different expertise, background, and interest. Partner with basic, clinical, engineering, and regulatory branches of your institution and externally. Diversity in both research team and population represented is not only desirable but essential. The second recommendation is to focus in problems of universal value that can only be solved by studying large amounts of data. Third, look for simple but universal and comprehensive solutions. Adopt well stablished conventions as much as possible, be systematic and clear, look for existent metadata archives already existent, even if decentralized, that can help you with data mining and standardization. Finally, be ethically and scientifically honest and pragmatic: it would be both naive and arrogant to think that any one group alone is able to explore all the prospects within the wealth of tones clinical data collected daily. Therefore, sharing data and tools to reproducible research is likely our major contribution for science advancement.


Lesion-based studies, some of them more than a century old, constitute the foundation of neuroscience. However, biomedicine has seen a shift from anecdotal experiences to objective, data supported evidence based on large amounts of data. In addition, AI technical developments still depend on the availability of high quality, large scale data. As a group primarily focused on mapping functional - anatomical relationships, as well on developing methods for patient stratification and personalized medicine, it was clear to us that public repositories and centralized collections, combined with initiatives to establish semantic and analytical consensus, are likely to represent the core structure for future neuroscience research. As a public, large and comprehensive dataset of stroke images and metadata, as well as methods for our group and others to share resources and analyze such data reproducibly had not existed yet, we teamed up with people of diverse expertise to create them:

1) Clinical, here represented by the Johns Hopkins Comprehensive Stroke, CSC (Dr. Leigh and Ms. Johnson, CSC co-director), which regularly records the admission and discharge basic clinical profiles of all patients admitted, as part of their participation on the national “Get with the Guidelines” (GWTG) stroke program. This provided a centralized record of all potential participants and eliminated the burden of data mining in medical records and metadata standardization.

2) Bioengineering, here represented by The Radiology Department (Dr. Faria, team leader) and the informatics team (IT), who offered the conditions to automatically transfer, anonymize and archive the large image sample under secure firewall. Our group has a large experience in imaging analysis and sharing of both data and image processing tools. Our collaboration with the School of Biomedical Engineering (BME) and the Neurology Department, allowed us to execute and oversee the practical plan to create the dataset and related tools.

3) Data Services, here represented by the Johns Hopkins Sheridan Libraries (Dr. Fearon and Dr. Lawson), who aided to implement the innovative workflow for certifying data privacy, mediated by the Medicine Data Trust Committee, providing expertise for disclosure risk review. They also advised and supported data deposit at ICPSR. This project Data Management Plan (DMP) won the 2023 DataWorks! DMP Challenge Prize.

Key principles
Collaborative partnership with the community;
Adoption of standards for image and metadata;
Established on comprehensive and accessible approaches;
Embraces diversify in population and technical aspects;
Democratizes data, analytical tools, and expands usability.
Video: How to learn from this practice
Research disciplines
Neuroscience, Medical Sciences, Bioengineering, Computer Science, Health Care
Supporting Documentation
Include links to relevant and publicly accessible website page(s), up to three relevant works that resulted from using your data sharing or reuse recipe or which were integral to the development of these practices publications, and/or up to three relevant resources.
Supporting documentation 2
Supporting documentation 4 (optional)
Supporting documentation 5 (optional)
Team information - Not Scored
Please respond to these questions related to team participation in the challenge. Your responses to questions in this section will not be scored by judges.
Entity Participation
My team is participating as an independent team
IDeA State Status (not scored)
n/a, I am not participating as part of an entity
Minority Serving Institution (not scored)
n/a, I am not participating as part of an entity
Participation in 2022 DataWorks! Prize
yes, our team captain or majority of our team participated in the 2022 DataWorks! Prize
2022 DataWorks! Prize Team name
Eligibility Requirements
yes, I have read and understand the eligibility requirements

comments (public)