submission voting
voting is closed.
Removing metadata barriers to promote data reuse
short description
We develop methods that address unstructured metadata and missing metadata that are barriers to discovering and reusing public omics data.
Submission Details
Please complete these prompts for your round one submission.
Submission Category
Data reuse
Abstract / Overview

In our project, we are developing new machine learning (ML) methods to assign comprehensive, standardized annotations to nearly 2 million publicly-available omics samples to promote their effective reuse by the biomedical community. We are developing:

  1. Methods based on deep-learning language models to infer standardized annotations from plain text descriptions, jointly from multiple omics types;
  2. ML models to predict structured metadata from molecular omics profiles, jointly from multiple species; and
  3. Methods to integrate these text- and omics-profile-based models to annotate nearly 2 million samples and 52k datasets, and develop tools for researchers to reuse this massive resource to glean novel biology.

We are a diverse, collaborative academic research group consisting of one faculty member, one postdoc, two graduate students, one undergraduate researcher, and one programmer. We are part of two departments (Computational Math, Science and Engineering; Biochemistry and Molecular Biology) and one service unit (Data Management and Analytics) at Michigan State University. The faculty member and postdoc take the lead on defining the questions, setting long- and short-term goals, overseeing progress, and providing timely mentoring and feedback. The graduate students work with the undergrad and the postdoc to develop the computational methods, which includes conceptualization, implementation using Python/R, data download/organization/processing, running algorithms on the high-performance cluster, and summarizing and interpreting results. Some of the graduate students and the postdoc also take the lead in reviewing, organizing, documenting, and disseminating the final software and data that enable other researchers to reproduce and extend our work. The programmer takes the lead in developing an interactive web-server that biologists can use to run our method and query, visualize, and export our results in a variety of convenient formats.

Potential Impact

Started in 2018, our project’s overarching goal is to assign comprehensive, standardized annotations to millions of publicly-available omics samples to enhance the ability of the whole biomedical community to discover, reuse, and interpret these published data. We have set out to achieve this goal by:

  1. Developing a method combining natural language processing (NLP) and machine learning (ML) to infer standardized sample annotations (e.g., tissue of origin, disease status, etc.) based on their plain text descriptions, and
  2. Developing an ML approach that predicts sample annotations (e.g., sex, age) based only on the molecular data recorded from the samples (e.g. genome-wide gene expression).


We have developed the first versions of the two methods and shared them with the community as an open-source repository The repo contains the pre-trained NLP-ML models and a Python utility called Txt2Onto for text-based sample annotation of new samples based on their descriptions and to train new custom text-based NLP-ML models based on user-defined training data.


A compelling aspect about our sharing/reuse practices is our effort to use our pre-trained models to add tissue annotations to hundreds of thousands of existing omics samples from human and mouse. The annotations are to terms in a controlled vocabulary of tissue and cell type terms in the UBERON and Cell Ontology. We are now working with a software developer to build a queryable web-interface Meta2Onto where a researcher can go to search for their sample attribute of interest (tissue, disease, etc.) and get back a comprehensive list of public omics datasets and samples with that attribute.


A practice we are implementing — and recommend to all researchers adopt — is the creation of fully reproducible open case studies that document (with data, code, description, and screencast) the application of ML-based methods to specific biomedical problems/datasets. These case studies — along with documentation of known and potential limitations of various aspects of data, models, and prediction — are critical for showcasing the application and improving the transparency and trust of data-driven ML research in biomedicine


As mentioned above, we shared our methods with the community as an open-source repository centered on a Python tool for text-based sample annotation called Txt2Onto. In addition to providing open, well-documented, modular code and pre-trained ML models, we invested effort in building Txt2Onto specifically to enable any researcher to not only replicate our approach but also to easily extend it.

Researchers can use Txt2Onto to do two things:

  1. Provide descriptions of new samples as input to have Txt2Onto perform text preprocessing, create a numerical representation for each sample’s text, and run the representations through our pre-trained ML models to make predictions about each sample’s tissue of origin.
  2. Supply with a new training dataset as input to have Txt2Onto train custom NLP-ML models for a new classification task (e.g., predicting sample disease annotation).


We have provided demo scripts and detailed instructions that outline how to use and extend the tool. This is an approach that we recommend to other researchers: invest effort not just in depositing code but in demonstrating clearly how the code can be used for various applications.

Potential for Community Engagement and Outreach

Currently, there are nearly 2 million omics samples and >52,000 omics datasets that are publicly available in repositories like the ArrayExpress. The majority of these data are from six major organisms used in biological research: human, mouse, rat, fly, worm, and fish. Data generated using the top 15 most common types of high-throughput experiments account for nearly 92–95% of all omics data.


These ~2 million samples capture large-scale cellular responses of diverse tissues and cell types in human and model organisms under thousands of different conditions, making these published omics data invaluable for researchers to reuse to:

  1. Reanalyze them to generate new hypotheses to answer new questions,
  2. Check the reproducibility of original findings,
  3. Perform integrative-/meta-analysis across multiple studies,
  4. Review earlier studies for support of new data/findings, and
  5. Meet urgent scientific needs (e.g. COVID-19).


However, given how inconsistent and incomplete sample annotations currently are, biologists who wish to use this valuable resource have to spend days, weeks, or even months finding and curating relevant published data, which takes away from devoting their resources to gleaning novel biology. We are working on removing this barrier.


Supporting Information (Optional)
Include links to relevant and publicly accessible website page(s), up to three relevant publications, and/or up to five relevant resources.

comments (public)