menu

Submission

introduction
title
Protein Designs for All (PDA)
short description
PDA: 40 years of designed proteins, uncovering trends, showcasing deep learning's influence, and opening new bioengineering doors.
Phase 1 Submission Form
Overview / Abstract

The field of protein design has evolved significantly over the last four decades, transitioning from rational design to data-driven approaches. The 2024 Nobel Prize in Chemistry, awarded to Dr. Baker, for computational protein design, and to Drs. Hassabis and Jumper for protein structure prediction by AlphaFold, underscores the transformative impact of protein folding and design, highlighting the critical importance of this research in revolutionizing biological and medical sciences.

To overcome the lack of a centralized resource for designed proteins, we present the Protein Design Archive (PDA), a database and web application of structurally characterized designed proteins. Analysis of the PDA reveals growth in design complexity and biases in amino acid usage and secondary structure. The PDA aims to guide future protein design strategies by enabling data-driven insights.

The PDA is freely available at https://pragmaticproteindesign.bio.ed.ac.uk/pda/.

Secondary Analysis: Research Aims

This project utilizes two primary data sources:

The Protein Design Archive (PDA): This database contains information on de novo designed proteins. The data includes PDB codes, release dates, classifications, amino acid sequences, structural information, and calculated properties like proteins related by sequence or structure. The source location is the PDA website (https://pragmaticproteindesign.bio.ed.ac.uk/pda/) and GitHub repository (https://github.com/wells-wood-research/chronowska-stam-wood-2024-protein-design-archive). The PDA contains 1,472 designed protein entries as of October 2024 and it continues to grow. The data types include categorical (e.g., tags labeling entries), text (e.g., PDB codes, sequences), and numerical (e.g., release date, mass, bit scores, and LDDT (Local Distance Difference Test) describing the degree of similarity between proteins.

This project analyzes designed proteins from the RCSB PDB (https://www.rcsb.org/) using bioinformatics and statistics. Data is manually curated, and analyses include comparisons of amino acid and secondary structure proportions, property calculations, and sequence/structure similarity searches.

 

  • Data retrieval and processing: Automated queries retrieve data from the RCSB PDB, followed by manual curation. 
  • Comparative analysis: Amino acid and secondary structure analysis are compared between designed proteins and the PDB over time.
  • Property analysis: DE-STRESS software calculates properties (mass, packing density, etc.) for designed and PDB proteins, analyzing trends.
  • Sequence/structure similarity: MMseqs2 and Foldseek are used to analyze sequence and structure similarity, comparing designed proteins to themselves and to natural proteins.

Proposed timeline:

  • Monthly, from September 2024 indefinitely: Update of the dataset by scraping data for newly released designs, curation of the dataset to contain only designed proteins.
  • Quarterly, from September 2024 to September 2027: Release new features to improve the database, including a help page, AI-powered search, a discussion forum, options for user feedback, and the ability for users to vote on data.
  • January 2025 - February 2025: Automation of the data collection, curation, scraping, and processing.
  • March 2025 - April 2025: Analysis of the PDA dataset to classify all entries based on method of development, and automation of classification in the future using AI.
  • May 2025 - June 2025: Analysis of the PDA dataset to understand protein functions, active site, and assess catalytic properties.

Data standardization and merging: Data published on the PDA will be standardized using a generalized format and merged into a single repository on Zenodo, which provides a robust and reliable platform for data sharing and archiving while adhering to FAIR (Findable, Accessible, Interoperable, Reusable) standards to facilitate easier access and analysis of the combined dataset. The dataset published on the PDA database is available for download in CSV and JSON file formats.

GREI Repository Data Sets
Zenodo (CERN and Northwestern University)
DOI (Digital Object identifier) of GREI Repository Dataset
https://zenodo.org/records/13928951
Outcomes and Outputs

We will continue to develop the PDA database content with the new designed proteins structurally characterised in labs around the glove.

 

  • Comprehensive Protein Design Database: The primary outcome is the Protein Design Archive (PDA) itself. This database offers a centralized, curated repository of de novo designed proteins, capturing a wealth of information about each design.
  • Insights into Design Trends: The project aims to uncover trends and biases in protein design over time. 
  • Evaluation of Design Novelty: The project seeks to assess the novelty of designed proteins by comparing them to natural proteins.
  • Community Resource: The PDA serves as a valuable resource for the protein design and structural biology communities.

How we will share and disseminate the data?

  • Open-Access Database: The PDA is freely accessible online (https://pragmaticproteindesign.bio.ed.ac.uk/pda/) without any registration requirements. This ensures broad dissemination and encourages use by the research community.
  • GitHub Repository: All data and code associated with the project are available on GitHub (https://github.com/wells-wood-research/prot-des-timeline). This promotes transparency and allows researchers to reproduce the analysis and build upon the project's findings.
  • Publications and Presentations: The project team will disseminate the research findings through peer-reviewed publications and presentations at scientific conferences. This will ensure wider visibility and contribute to the scientific literature.
  • Zenodo Repository: The curated dataset will be made available on Zenodo, a general-purpose open-access repository. This will further enhance data accessibility and long-term preservation.

FAIR and CARE Principles

  • Findable: The PDA is easily findable through its dedicated website and the associated GitHub repository. The use of standardized identifiers (PDB codes) and rich metadata further enhances findability.
  • Accessible: The database is open-access, requiring no registration for users to browse and download data. The data is available in multiple formats (e.g., CSV, JSON).
  • Interoperable: The use of standardized data formats and the integration with the PDB and Zenodo promote interoperability, allowing the data to be easily combined with other resources.
  • Reusable: The provision of comprehensive metadata, clear data descriptions, and open-source code facilitates data reuse and supports future research in protein design.
  • CARE Principles: While not explicitly addressed in the text, the project aligns with the CARE principles for Indigenous data governance. 

Replicability and Reproducibility

  • Detailed Methodology: The text provides a comprehensive description of the data collection, curation, and analysis methods. This level of detail enables other researchers to replicate the analysis and verify the findings.
  • Open-Source Code: All code used for data processing and analysis is publicly available on GitHub. This allows for scrutiny and reuse of the code, ensuring reproducibility.
Impact/ Scientific Significance

Following is the breakdown of the potential impact of the Protein Design Archive (PDA) project on human health and its contributions to relevant scientific disciplines:

Contributions to Scientific Disciplines

  • Protein Design: The PDA directly contributes to the field of protein design by offering a centralized resource for researchers to explore existing designs, analyze trends, and identify areas for future innovation. This can accelerate the development of new protein design methods and applications.
  • Structural Biology: The PDA provides valuable insights into the relationship between protein sequence, structure, and function. By analyzing the characteristics of designed proteins, researchers can gain a deeper understanding of protein folding principles and the determinants of protein stability.
  • Bioinformatics: The project showcases the power of bioinformatics approaches in curating, analyzing, and visualizing large-scale protein structure data. The methods and tools developed for the PDA can be applied to other areas of bioinformatics research.
  • Data Science: The PDA contributes to the field of data science by demonstrating the importance of data sharing, standardization, and FAIR principles in facilitating scientific discovery and collaboration.

Impact on Diagnosis, Treatment, and/or Prevention

While the PDA is primarily a research tool, it has the potential to indirectly impact human health in several ways:

  • Drug Discovery: The PDA can aid in the design of new proteins with therapeutic applications, such as:
    • Targeted therapies: Designing proteins that can specifically bind to and neutralize disease-causing agents (e.g., viruses, bacteria, toxins).
    • Drug delivery: Creating proteins that can efficiently deliver drugs to specific cells or tissues.
    • Biologics: Developing protein-based drugs with improved efficacy and safety profiles.
  • Diagnostics: The PDA can support the development of protein-based diagnostic tools, such as biosensors for detecting disease biomarkers or pathogens.
  • Vaccine Development: The design of novel proteins can aid in the development of more effective and safer vaccines.
  • Biomaterials: The PDA can contribute to the design of protein-based biomaterials for applications in tissue engineering, regenerative medicine, and drug delivery.

By facilitating protein design research and promoting data sharing, the PDA has the potential to accelerate the development of new technologies with significant implications for human health. It's important to note that the PDA's impact on diagnosis, treatment, and prevention will be realized indirectly through the research it enables and the innovations it inspires.

 

Team

This project is a collaboration between UK and USA/Italian researchers with expertise in protein design, structural biology, and bioinformatics. The team includes:

  • Marta Chronowska: Doctoral student at The University of Edinburgh focused on advancing protein design using computational approaches. She contributes to data analysis and web app development for the PDA.
  • Michael J. Stam, PhD: Postdoctoral researcher at the University of Edinburgh specializing in machine learning for protein design. His background includes applying machine learning in finance to analyze customer behavior.
  • Dek Woolfson, PhD: Professor at the University of Bristol, directing the Bristol BioDesign Institute with a focus on protein design and synthetic biology. He is a recipient of the Royal Society of Chemistry Interdisciplinary Prize and a Humboldt Research Award.
  • Luigi F. Di Costanzo, PhD: Structural chemist with over 20 years of experience in structural biology of designed proteins and a decade of contributions to biocuration for wwPDB.
  • Christopher W. Wood, PhD: Senior biotech lecturer at the University of Edinburgh, developing software to simplify protein design using bioinformatics and machine learning.
Considerations

The success of the Protein Design Archive (PDA) project relies on key factors:

  • Team Expertise: Our team's extensive experience in structural chemistry and protein design is essential for accurate data curation, insightful analysis, and effective communication of research findings.
  • Data Quality: Maintaining high standards for data curation involves careful selection of entries and accurate data extraction. We will prioritize enriching the data and keeping the database up-to-date.
  • User Experience: The PDA website will be designed for user-friendliness, with clear organization and effective search functionality. We will ensure data accessibility and actively seek community feedback.
  • Technical Infrastructure: The database will be scalable to handle increasing data and traffic.
  • Community Engagement: Collaborating with researchers and promoting the PDA will increase visibility/adoption.
  • Sustainability: Securing funding is crucial for the long-term maintenance and development of the PDA.

 

Supporting Documents
Provide up to 10 resources for the evaluation of your secondary research project including but not limited to: ● The persistent identifier of the dataset(s), other than GREI dataset DOIs already listed above, to be used in the proposed project (where available) ● Tools/workflows or resources to be utilized in the proposed project ● Relevant references or scientific publications that directly relate to the proposed project
Non Scored Criteria
Please complete this information. It will not be scored by the evaluation panel.
Entity Participation
Participate as an independent Team (i.e., registering as a group of individuals competing together but not on behalf of an established organization, institution, or corporation)
Research Discipline (non-scored criteria)
Bioinformatics, Structural Biology, Protein Design/Engineering, Computational Biology, Synthetic Biology
IDeA State (non-scored criteria)
Yes
All Team Member Information - Name, Organization, Job Title, and Email address
Marta Chronowska
Organization: University of Edinburgh
Job Title: Doctoral Student
Email: M.Chronowska@sms.ed.ac.uk
Michael J. Stam, PhD
Organization: University of Edinburgh
Job Title: Postdoctoral Research Associate
Email: michael.stam@ed.ac.uk
Dek Woolfson, PhD
Organization: University of Bristol
Job Title: Professor, Director of Bristol BioDesign Institute
Email: D.N.Woolfson@bristol.ac.uk
Luigi F. Di Costanzo, PhD
Organization: University of Naples Federico II
Job Title: Associate Professor of Chemistry
Email: luigi.dicostanzo4@unina.it
Christopher W. Wood, PhD
Organization: University of Edinburgh
Job Title: Senior Biotech Lecturer
Email: chris.wood@ed.ac.uk
MSI (non-scored criteria)
Yes
Participation in prior DataWorks! Prizes (non-scored criteria)
No
Team Point of Contact Eligibility
yes
Eligibility (non-scored criteria)
Yes, I confirm that I have read and meet the terms of eligibility for this challenge