menu

Submission

introduction
title
ML for NMR Shift Prediction and Structure Analysis
Phase 2 Submission Questions
Please complete these questions for Phase 2 participation.
Team Name

Pengyu Hong's team

Team Member Updates (non-scored criteria)

No change in team member

GREI Repository Datasets
Dataverse
Repository dataset DOI's
Overview/Abstract

This project investigates whether incorporating 3D conformal information can improve machine learning-based NMR chemical shift predictions. During NMR measurements, molecules continuously rotate and vibrate through multiple conformational states. However, most experimental NMR databases contain only 2D structures without 3D conformational data. GEOM provides quantum-optimized conformers that addressing this gap. We develop a new NMR prediction ML model, which integrates 2D topological information of molecules with 3D geometric information of their conformers. Predictions are made for each conformer and are combined via Boltzmann-weighted averaging to derive the predictions for each molecule. Results show the proposed ensemble approach improves prediction accuracy by capturing conformational variation. For drug discovery, improved NMR predictions would accelerate molecular structure elucidation, enable faster candidates identification, and reduce reliance on expensive NMR experiment. 

Secondary Analysis

This study integrates data from two primary sources. The GEOM dataset, in the GREI repository, provides approximately 300,000 molecules with quantum-mechanically optimized 3D conformer ensembles, where each molecule contains multiple Boltzmann-weighted conformations representing thermally accessible geometries. The second dataset contains molecules with NMR annotations and is from our team's publish paper, TransPeakNet for solvent-aware 2D NMR prediction via multi-task pre-training and unsupervised learning (https://doi.org/10.1038/s42004-025-01455-9). For convenience, in the rest of this report, we refer to the dataset used in our paper as the TransPeakNet dataset, which is publicly available at https://github.com/siriusxiao62/2dNMR. Our experimental NMR database comprised 33,000 molecules with SMILES representations and atom-level chemical shift measurements across three spectroscopic modalities: carbon-13 NMR, proton NMR, and two-dimensional Heteronuclear Single Quantum Coherence (HSQC). 

The motivation for incorporating the GEOM data stems from a fundamental limitation of our experimental NMR database: it contains only SMILES string representations and measured chemical shifts, lacking three-dimensional conformer information. While computational tools like RDKit can calculate 3D structures through force field minimization, the results are considerably less accurate than those optimized with quantum mechanical methods. GEOM addresses this critical gap by providing DFT-minimized molecular conformations with realistic bond lengths, angles, and torsional geometries that capture subtle electronic effects influencing NMR chemical shifts. Thus, GEOM serves as an essential complement that enables us to augment our purely spectroscopic dataset with the accurate three-dimensional structural information necessary to test whether geometric features improve prediction performance.

The integration of the GEOM and TransPeakNet datasets occur at two stages: (1) self-supervised pretraining on molecules that are matched between the GEOM and TransPeakNet datasets, and (2) supervised fine-tuning on directly matched molecules. In the first stage, direct canonical SMILES comparison between the molecules in the GEOM dataset and identifies ~2,400 molecules with carbon-13 labels (among them, ~1,000 has proton NMR shifts, and ~1,000 with HSQC correlations). To address this limited overlap and leverage GEOM's extensive conformer database, we designed a two-stage training strategy. The rest molecules with NMR data present a valuable opportunity: while lacking the direct matches in GEOM, we could identify chemically similar molecules in GEOM for pretraining. Fingerprint-based Tanimoto similarity is used to search for GEOM molecules that are structurally related to molecules in the TransPeakNet dataset. This similarity-based approach identified 5,000 GEOM molecules chemically related to our dataset, providing unlabeled 3D structures for self-supervised pretraining. The pretraining stage enables learning generalizable 3D molecular representations from unlabeled GEOM conformers without requiring NMR measurements, helps develop robust geometric feature extractors, and mitigates overfitting on the relatively small dataset containing matched molecules with NMR shift annotations.

The second stage uses an adaptive graph neural network architecture to integrate topological and 3D geometric information of molecules for predicting their atomic NMR shifts. The architecture contains parallel pathways for learning chemical shift patterns: (a) a 2D pathway (our TransPeakNet model) learn to capture the characteristics of molecular topology and (b) a 3D pathway processes the 3D coordinates of atoms provided in GEOM through distance encoding, directional geometric modeling, and rotationally equivariant layers. The 3D pathway contains three 3D GNNs: SchNet (Schütt et al., 2017 (5)), ComENet (Wang et al., 2022 (6)), and EGNN (Satorras et al., 2021 (7)). The hidden representations learned by the above three 3D models are fused and adaptively mixed with those learned by our 2D model through trainable gating. The parameters of the 3D pathways are pretrained in the first stage. These two pathways are combined through a trainable gating mechanism that computes per-atom mixing weights, allowing the model to discover which chemical environments benefit most from explicit 3D geometric information. The role of GEOM conformers extends beyond providing 3D geometric information. Each GEOM molecule is associated with a set of optimized conformations with Boltzmann weights representing conformational populations. During prediction, all available conformers are processed independently and the predictions made on them are combined through weighted averaging. This ensemble approach captures the experimental reality that NMR measurements of each molecule represent population-weighted averages over interconverting conformations rather than single fixed geometric form.

Outcomes and Outputs

Our results reveal important insights about when 3D conformational information provides value. For carbon-13 NMR prediction, incorporating GEOM's 3D conformer ensembles yields substantial improvements: (a) the 2D baseline model achieve 2.20 ppm mean absolute error (MAE), (b) integrating single 3D-conformer improves prediction MAE to 2.0 ppm, and (c) ensembling multiple 3D-conformers further reduced error to 1.8 ppm, a 20% improvement compared to 2D baseline. This demonstrates that carbon chemical shifts are highly sensitive to conformational information and can benefit from the 3D geometric information captured by the combined SchNet-ComENet-EGNN module. Meanwhile, proton NMR and 2D HSQC predictions show marginal improvements of approximately 5% despite. We specifically tested whether the SchNet, ComENet, and EGNN geometric encoders would benefit HSQC predictions, given that HSQC correlates carbon and proton shifts through scalar coupling. Our results suggest that the 2D baseline models already capture the dominant factors governing proton shifts and the through-bond coupling patterns. This differential performance provides valuable scientific insight: conformational information matters most for carbon-13 where through space electronic effects and steric environments significantly modulate chemical shifts, while proton-based predictions depend more on local topological patterns.

Atom-level analysis of the top 30 most improved predictions reveals systematic patterns in which chemical environments benefit from 3D conformational information. Approximately 80% of the most improved atoms exhibit SP2 hybridization, particularly in conjugated aromatic systems where planarity and torsional angles critically affect π-electron distribution. The remaining SP3 atoms that benefited showed high conformational flexibility or steric crowding adjacent to multiple heteroatoms.  Heteroaromatic systems dominate the most improved atoms, with nitrogen and oxygen atoms in five-membered rings (thiazoles, isoxazoles, thiadiazoles) showing big improvements. 

These patterns validate that the model learns chemically meaningful distinctions: atoms in environments where through-space electronic effects, conjugation geometry, and conformational averaging dominate receive preferential weighting of 3D features, while atoms in simple aliphatic or rigid symmetric environments rely more on 2D connectivity. This atom-level analysis demonstrates that improvements arise from physical understanding. 

The ML-based methodology developed in this project can be applied to utilize conformation information to better predict other molecular properties. The project utilized established standards including canonical SMILES, Morgan fingerprints, and DFT-optimized geometries from GEOM. Primary resources were the publicly available GEOM repository and our peer-reviewed NMR database. Tools included PyTorch Geometric for graph neural networks, RDKit for molecular processing, and standard Python scientific libraries. All data sources are publicly accessible with proper citations.

Replicability is addressed through detailed documentation of data sources, explicit similarity search parameters, and standardized model architectures with publicly available ML models (SchNet, ComENet, EGNN). Reproducibility is supported by fixed random seeds, publicly available datasets. All data processing, training, testing code, and evaluation reports are available in our public GitHub repository: https://github.com/whr812756608/3D-conformers-enhanced-machine-learning-prediction-on-NMR-spectroscopy

Impact

This research advances computational chemistry and machine learning methodologies for atomic NMR shift prediction. The demonstration that 3D conformer ensemble averaging improves NMR prediction accuracy establishes conformational dynamics as a critical factor for quantum property modeling, challenging the adequacy of single-structure approaches prevalent in current methods. The two-stage modeling strategy maximizes utilization of GEOM's structural data while minimizing dependence on expensive labeled NMR measurements, demonstrating a practical approach for incorporating high-quality 3D coordinates into NMR shift prediction despite limited number of molecules containing both structural and spectroscopic information. This framework addresses a common bottleneck in computational chemistry: structural and property data exist in separate repositories with limited overlap. The adaptive architecture that learns context-dependent weighting of geometric versus topological features contributes to interpretable machine learning by revealing which molecular environments require explicit 3D information. Furthermore, the work validates GEOM (and more general, 3D conformer information) as a foundational resource for geometry-dependent property prediction beyond its original scope.

The impact on human health operates primarily through drug discovery and development pipelines. Improved NMR prediction accelerates molecular structure elucidation, which is critical for identifying and characterizing potential therapeutic candidates. When medicinal chemists synthesize new compounds or isolate natural products with potential bioactivity, accurate structure determination is essential before biological testing can proceed. Faster, more reliable computational NMR prediction reduces dependence on expensive and time-consuming experimental characterization, enabling higher-throughput screening of chemical libraries and more rapid iteration in lead optimization. For natural product drug discovery, where complex molecules from plants, marine organisms, or microbes often possess unique therapeutic properties, improved structure elucidation tools facilitate identification of bioactive compounds. The methodology also supports pharmaceutical quality control by enabling computational verification of synthesized compound structures. While not directly impacting patient diagnosis or treatment protocols, these contributions to the drug discovery pipeline indirectly support development of new therapeutics for various diseases by reducing time and cost barriers in the early stages of pharmaceutical research.

Considerations

The proposed project was successfully completed within the award period. All major objectives including data integration, model development and evaluation were accomplished. During the implementation phase, we refined the project scope to focus more specifically on NMR chemical shift prediction rather than broader molecular property prediction. The primary constraint encountered was limited overlap between our experimental NMR dataset and the GEOM conformer repository. Direct canonical SMILES matching identified only 2,400 molecules with carbon-13 NMR labels, 1,000 with proton labels, and 1,000 with HSQC correlations present in both datasets—a small fraction of our 33,000-molecule NMR database and GEOM's 300,000-molecule repository. This limited matched dataset imposes the risk of overfitting and underutilization of the structural data in GEOM.

We addressed this constraint through a similarity-based data expansion strategy. Using Morgan fingerprint similarity searches with Tanimoto coefficient, we identified ~5,000 molecules in GEOM with high structural similarity to the unmatched compounds in our NMR dataset. These similar molecules, while lacking NMR data, provided rich 3D conformer geometry information suitable for self-supervised pretraining. The pretraining stage enabled learning generalizable three-dimensional molecular representations from this expanded dataset without requiring additional expensive experimental measurements

Research Quality

Both primary data sources underwent rigorous peer-review validation prior to their usage in this study. GEOM is a publicly available repository containing quantum-mechanically optimized conformers validated through published generation protocols. Our experimental NMR database was peer-reviewed and published in Nature Communications Chemistry (TransPeakNet, https://doi.org/10.1038/s42004-025-01455-9), ensuring data quality through journal review processes. 

Several data quality barriers were encountered during implementation. Computational constraints arose from similarity searching across ~30,000 queries against 300,000+ GEOM molecules, requiring extensive fingerprint computation. Data heterogeneity presented integration challenges, as combining quantum mechanical structural data with experimental spectroscopic measurements required careful handling of different formats, coordinate systems, and quality standards across independently developed datasets.

Research Challenges (non scored criteria)

Data format heterogeneity required integrating quantum mechanical coordinates with experimental spectroscopic measurements. We addressed this through systematic validation pipelines including geometric consistency checks and canonical atom ordering.

Computational scalability posed difficulties when searching 30,000 queries against 300,000+ GEOM molecules. Heavy atom pre-filtering reduced search space by 99%, enabling practical completion times.

Supporting Documents
Provide resources for the evaluation of your secondary research project including but not limited to (up to 10 URLs): ● Publicly available outputs of the secondary analysis including results, methods, conclusions and relevant metadata ● The persistent identifier of the datasets used and generated ● Standards, tools, and metadata associated with the implementation outputs ● Articles, preprints, or scientific publications on the project ● Other relevant related resource