VEuPathDB is dedicated to the integration of diverse Omics data, advancing scientific discovery. Our core goal is to enable the global research community to effectively reuse, interrogate and collaborate on data long after the initial studies are complete. To this end, we partner with researchers to identify & prioritize new datasets, datatypes and features. We use semi-automated workflows for ingestion, cleaning, processing, integration, analysis, and validation of data. A sophisticated Search Strategy system allows scientists to conduct in silico experiments that can be as simple as reinterrogating an individual dataset, or involve complex multi-step strategies combining results from different datatypes across many species.
VEuPathDB includes ~60 staff scientists, programmers, software developers, bioinformaticians, systems & database admins, data type specialists, ontologists, and outreach specialists at multiple institutions, in multiple countries. Team members are diverse in their career stage, path, gender, and ethnic backgrounds, but work together on a daily basis as a single team, to ensure efficient operations and overall quality and growth aligned with the needs of the research communities we serve. The DataWorks team representing this project includes:
The Eukaryotic Pathogen, Vector and Host Informatics Resource (VEuPathDB.org) is an NIAID-funded bioinformatics resource center (BRC), with additional support from other sources. VEuPathDB includes multiple taxon-specific databases (PlasmoDB, FungiDB, VectorBase, etc) constructed using a common infrastructure to provide scientists with customized access to their organisms of interest. Data are integrated in partnership with the scientific community, providing data producers with the opportunity to QC results prior to public release, positioning VEuPathDB as an honest broker, and providing the global research community with reliable, effective access to high quality datasets.
VEuPathDB integrates diverse datatypes from eukaryotic pathogens, along with their vectors & hosts, using standardized analysis pipelines. We empower users to ask their own questions, using intuitive and configurable searches and a graphical Search Strategy system, enabling scientists to effectively interrogate the underlying data, combining results from diverse datatypes, and leveraging orthology to make cross-species inferences. This project has been successfully funded since 1998, and as a production resource since the inception of the BRC program in 2004, enabling expansion from supporting a handful of organisms and datatypes to its current iteration supporting hundreds of diverse eukaryotic pathogens (protists and fungi), arthropod vectors of disease, and host species. The resource incorporates many datatypes, including some that lack traditional repositories, such as genome-wide phenotypic data.
A critical aspect of effective data reuse is to go beyond simply replicating the main results from original publications to enable scientists to effectively search the data and make new discoveries. This philosophy has been adopted by the scientific communities we support. Extensive outreach activities and reuse of dataset in VEuPathDB makes it clear that effective integration for reuse dramatically increases data ‘shelf life’, stretching the dollar-value of the originally funded studies by many fold. Website access statistics show that VEuPathDB resources receive >40K unique visitors per month, from >200 countries, and cumulative citations of VEuPathDB and its component databases exceeds 26K (>3,500 Google Scholar citations in 2022 alone).
Data integration and harmonization has been a focus since the inception of this project. Integration of data sets means transforming them into an information-revealing format, placing them in a common system, and cross-referencing to relate information across datasets. Harmonization means that similar information across datasets can be related and discovered, typically using formal ontologies and controlled vocabularies. The VEuPathDB system was built to address these principles via standard terminologies to provide consistency across datasets within the BRC and across resources (e.g. other NIAID data repositories). We primarily use the Open Biomedical Ontology Foundry, providing interoperable ontologies with wide usage and coverage. Datasets are loaded using a simplified version of the ISA-Tab format (ISA-Simple), that includes usage of ontology terms to describe experimental conditions and methods, sample details, and study subject data. In addition to data capture through ISA-Simple files, ontology terms are used as part of functional annotation (e.g. Gene Ontology terms) and structural annotation (Sequence Ontology terms). All ontology terms are stored in Genomics Unified Schema (GUS) databases. VEuPathDB infrastructure also takes advantage of the powerful Ensembl bioinformatics pipelines combined with the robust and proven GUS, supporting the back-end database and the highly flexible Web Development Kit.
VEuPathDB is dedicated to understanding and meeting the needs of its diverse research communities. The outreach team is recognized as a leader (and has served as a model for others) in building effective community ties through liaising with data providers, database users, and developers to ensure that community needs are accurately identified and addressed. Our client audience is generationally diverse and globally distributed, necessitating multiple methods for information dissemination. Outreach team duties include: user interactions to maintain awareness of emerging datasets & data types; promotion of the benefits of data integration & reuse; education/promotion of VEuPathDB as a force-multiplier for research & hypothesis generation; developing & maintaining data/metadata standards supporting reuse; understanding end-user concerns and soliciting feedback to improve, including development of new features; help desk & social media presence (including a YouTube channel); workshops, webinars, and participation in scientific meetings. From interacting with our community, we have learned that data reuse must be easily achievable and sophisticated enough to meet their scientific needs.