Submission

OpSci Commons

submission voting

voting is closed.

introduction

title

Free and Permanent Archival of Massive Datasets

short description

OpSci Commons is a decentralized protocol for automated archival and peer-to-peer sharing of massive FAIR-compliant datasets.

Submission Details

Please complete these prompts for your round one submission.

Submission Category

Data sharing

Abstract / Overview

OpSci Commons is a data repository providing free archival, indexing, search, and discovery for massive FAIR-compliant neuroimaging datasets. Commons is a permanent home for open access datasets such as Neurovault, and ABIDE. We built this service for the neuroscience community to provide a free and streamlined UX solution for archiving and sharing large datasets. Commons is a decentralized application (dApp) that executes open source workflows on a distributed network of nodes that ensure tamper-free consensus, content addressed data, verified deterministic results, and no-single point of failure. Since launching, some archived data has actually disappeared from original indices, leaving the only copies to be found on OpSci Commons.

Team

The OpSci Commons team came together during the 2020 Brainhack, where we tackled the problem of peer-to-peer sharing of neuroimaging datasets using IPFS. Our team is composed of data engineers and neuroscientists and we demonstrated how existing neuroinformatics tools can smoothly integrate with peer-to-peer file sharing protocols as an alternative to centralized web storage platforms. Our team is quite diverse, with individual contributors from India, Indonesia, Bulgaria, Romania, Germany, the United Kingdom, Canada, and the United States.

At the start, our team opened up contribution to anyone on the internet. We created a Decentralized Autonomous Organization, OpSci DAO, to coordinate these distributed contributors. The DAO model provides tools for governance, establishes transparency, and incentivizes work in the form of "permissionless" bounties. In many ways, we were the very first Open Data Cooperative, where anyone in the world can contribute to Open Science. Our primitive model was successful, and just one of the first of many, and more sophisticated, Science DAOs. Key to this model is the embracement of Open Source community practices using smart contracts to mitigate disputes, trust, and automate workflows.

Potential Impact

Our team came together to solve a data sharing problem between institutions across continents in December 2020. This work continues to the present day. The goals of the OpSci Commons project are to:
- 1) automate the archival of 500TB of open access neuroimaging data,
- 2) provide free permanent storage for verified neuroscientists, and
- 3) create automated indexing services that map IPFS content identifiers across scientific services (DOI), allowing work published in decentralized science (DeSci) communities to be shared across existing dissemination networks.
As part of our solution, we incentivize FAIR practices by only allowing users with verified ORCID accounts to upload to the data commons, for free. Furthermore, datasets have to be formatted in a community standard and pass an automated validation. Lastly, we have designed our solutions with reverse-compatibility in mind. Our indexing microservices allow events that occur in the DeSci ecosystem to be published and indexed by existing services for search & discovery, such as DOI, Figshare, or Dryad.
From our experience coordinating a DeSci community, we recommend that all researchers 1) create an ORCID to track their impact, 2) use versioning tools like datalad/git-annex for content-addressed data, and 3) adopt community standards for data sharing. Further, we recommend the adoption of Science DAOs for community-driven data governance and coordination.
Our team was the first to demonstrate the large scale archival of massive datasets on distributed storage networks, providing this service for free. Not only did our solution provide a reliable decentralized archival mechanism, we also demonstrated how to improve knowledge artifact indexing using content identifiers (CIDs) to tackle challenges for data fidelity, provenance, and discoverability. Most significantly, we introduced novel mechanisms for incentivizing FAIR behavior, such as making interoperability and verified accounts a requirement for using community services. Notably, our team did not have the support of an institution or organization, this work was primarily done by an international group of independent scientists, engineers, and students that coordinated using the DAO framework. Lastly, data that we have archived may have disappeared from existing indices but will remain available through out data sharing solution, increasing data extancy.

Replicability

OpSci Commons is architected around principles for replicability. We leverage the use of the Brain Information Dataset Structure (BIDS) specification, a standard for sharing interoperable MRI, MEG, ECoG, and EEG data in the neuroscience community. The BIDS standard is incredibly significant because it means developers can write workflows on data they haven't seen yet and feel confident the code will execute deterministically. Furthermore, applications that run on BIDS datasets can generate proof-of-compute based on a cryptographic digest of the results (i.e., a Content identifier or CID). This means researchers can simply compare the CIDs of their results to determine if their pipelines generated the same result. Replicability standards are enforced on our platform by only allowing researchers to upload datasets if their submission passes an automated validity check for BIDS-compliance. We have expanded on the BIDS spec to include additional metadata fields for accessing data by CIDs and assigning authors a decentralized Identifier (DIDs) linked to their ORCID. In the future, we expect to expand the number of standards we support, including Neurodata without Borders, and DANDI dataset structures.
Our architecture is completely open source and available for anyone to fork and run their own community archival repository. This approach can be replicated by following the documentation on our github.

Potential for Community Engagement and Outreach

OpSci is a pioneer in designing decentralized autonomous organizations (DAO) for science. DAOs make it possible for anyone, anywhere, to get involved by completing bounties, tasks, and solving outstanding problems in a science community. Think of Science DAOs as digital community spaces for coordination and collaboration around research problems. Despite no direct institutional or academic support, our core group of independent researchers, developers, and open science activists has come together with over 1000+ others to build community-owned decentralized protocols that advance science through increased inclusion, access to resources, and transparent governance. The DAO model has potentially significant consequences for how we, as a globally-distributed community of scientists, can coordinate around challenging problems. The most important message we have for other researchers is that impactful science is a collaborative art and that significant discoveries are most likely to occur in communities were access to, and governance over, resources is made open access.

Supporting Information (Optional)

Include links to relevant and publicly accessible website page(s), up to three relevant publications, and/or up to five relevant resources.

Supporting Documentation 01

https://pulse.opsci.io/rich-in-data-poor-in-wisdom-science-needs-a-decentralized-data-commons-98c7ffdb56a1

Supporting Documentation 02

https://www.youtube.com/watch?v=wlmzDswLlsA

Supporting Documentation 03

https://commons.opsci.io

Supporting Documentation 04

https://verse.opsci.io

Supporting Documentation 05

https://ipfs.io/ipfs/bafkreieuf5boryzq423q65xcpsumhw3us4kob4uwpvy6yjb6hsdqkk4hb4

comments (public)

Was this page helpful? yes no