ROUGH DRAFT authorea.com/93692

# Protein Structure Refinement using Quantum Mechanics-Based Chemical Shift Prediction

Abstract

Alternative title: Small changes in protein structure improves quantum mechanics-based chemical shift prediction

Can QM-based chemical shift prediction be made as accurate as empirical methods by making small changes to the protein structure?

If these changes are small enough, it may not make sense to talk if an improvement

Real question (not addressed here) if protein structure determination is more accurate with ProCS15. E.g. does one get better structures starting from 4-6 Å structures?

Comparison to CHARMM only refinement?

# Introduction

Some of us recently showed (Christensen 2013) that protein refinement using a DFT-based backbone amide proton chemical shift predictor (ProCS) yielded more accurate hydrogen-bond geometries and $$^\text{3h}$$J$$_\text{NC'}$$ coupling constants involving backbone amide groups than corresponding refinement with CamShift. Furthermore, the ProCS predictions based on the structurally refined ensemble yielded amide proton chemical shift predictions that were at least as accurate as CamShift. This suggests that the larger RMSD observed for QM-based chemical shift predictions may, at least in part, be due to relatively small errors in the protein structures used for the predictions, and not a deficiency in the underlying method. This paper tests whether this is true for other back bone atoms and C$$\beta$$ using ProCS15 (Larsen 2015). We first describe how the isotropic chemical shielding values can be related to the experimental chemical shifts as part of the simulation. Secondly, we perform Monte Carlo simulations to refine the structures of xx proteins ...

# Theory

## Overview

The protein structures are refined using Markov Chain Monte Carlo (MCMC) simulations and a hybrid energy function based on a standard force field energy $$(E_{\text{FF}})$$ augmented by an energy term $$(E_{\text{CS}})$$ that reflects the agreement between predicted and experimental chemical shifts

$\label{eqn:ehybrid} E_{\text{hybrid}} = E_{\text{FF}} + wE_{\text{CS}}$

The optimum weight $$(w)$$ of the chemical shift data is determined as part of the simulation as described below. If one assumed that the probability that the predicted chemical shifts match the experimental values $$(p_{\text{CS}})$$ follows a Gaussian distribution (Bratholm 2015), Eq \ref{eqn:ehybrid} can be rewritten as as

\begin{align} E_{\text{hybrid}} & = E_{\text{FF}} - k_{\text{B}}T \ln (p_{\text{CS}})\\ & = E_{\text{FF}} + k_{\text{B}}T \sum_j^n \left( (N_j+1)\ln(s_j) + \frac{\chi_j^2}{2s_j^2} \right) \end{align}

Here $$k_{\text{B}}$$ is Boltzmann’s constant, $$T$$ is the temperature of the simulation, $$n$$ is the number of different atom types (C$$_\alpha$$, C$$_\beta$$, H$$_\alpha$$, etc.) for which chemical shifts are available, $$N_j$$ is the number of chemical shifts of nuclei type $$j$$, and $$s_j$$ is the uncertainty (variance) in the prediction of chemical shift-type $$j$$. From this it is seen that the variances are effectively describing the weight $$w$$ of the experimental data.

Finally, $$\chi_j$$ is defined as

$\chi_j^2 = \sum_i^{N_j} \left( \delta_i - \delta_{\text{pred},i} \right)^2$

where $$\delta_{i}$$ is the experimental chemical shift for nucleus $$i$$ and $$\delta_{\text{pred},i}$$ is the corresponding predicted value.

In this study we use ProCS15 (Larsen 2015) to predict the corresponding isotropic chemical shielding value $$\sigma_i$$, which we related to $$\delta_i$$ by

$\delta_{\text{pred},i} = a\sigma_i + b$

$$a$$ and $$b$$ are determined for each atom type by minimizing the deviation from the experimental chemical shifts as part of the simulation (see next subsection)

$\boldsymbol{\delta} = {\pmb a}{\pmb \sigma} + {\pmb b}$

## Details

Following the inferential structure determination approach by Rieping, Habeck and Nilges (Rieping 2005), the hybrid energy corresponds to the joint posterior density for all unknown parameters

\begin{aligned} p\left({\pmb X}, {\pmb \theta} \mid d \right) \propto p\left(d \mid {\pmb X}, {\pmb \theta} \right) \pi \left( {\pmb X} \right) \pi \left( {\pmb \theta} \right) \\ E_{\text{hybrid}} = -k_{\text{B}}T \ln \left( p\left({\pmb X}, {\pmb \theta} \mid d \right) \right) \\ E_{\mathrm{CS}} = -k_{\text{B}}T \ln \left(p\left(d \mid {\pmb X}, {\pmb \theta} \right) \pi \left( {\pmb \theta} \right) \right)\end{aligned}

where $$d$$ is the experimental data $$(\{{\pmb \delta} \})$$, $${\pmb X}$$ the given structure and $${\pmb \theta}$$ unknown model parameters $$({\pmb a},{\pmb b},{\pmb s})$$.

Here a Bayesian linear regression model is used to describe the agreement between the prediction of chemical shieldings by ProCS15, $$\left\{ {\pmb\sigma}\right\}$$, and the chemical shifts found experimentally, $$\left\{{\pmb \delta}\right\}$$, such that $${\pmb \delta}_i = {\pmb a} \cdot {\pmb \sigma}_i + {\pmb b} + {\pmb \epsilon}_i$$ for residue $$i$$, with $${\pmb \epsilon}_i$$ being a zero centered normal error with diagonal covariance matrix $$\pmb s$$. $$\pmb s$$ has dimension $$N \times N$$ with all off diagonal elements being 0, and $$\left\{ \pmb{\sigma}\right\}$$ and $$\left\{{\pmb \delta}\right\}$$ are sets of $$N$$ dimensional vectors for all $$N$$ residues.) Thus \begin{aligned} p\left(\left\{ \pmb{\delta}\right\} \mid {\pmb a}, {\pmb b}, \left\{{\pmb \sigma}\right\},{\pmb s}\right) &\propto& \prod_{i=1}^N |{\pmb s}|^{-1/2} \exp\left( - \frac{1}{2} \left({\pmb \delta}_i - {\pmb a} \cdot {\pmb \sigma}_i - {\pmb b} \right)^{\mathrm T} {\pmb s}^{-1} \left({\pmb \delta}_i - {\pmb a} \cdot {\pmb \sigma}_i - {\pmb b} \right) \right) \\ &=& \prod_{j=1}^n\prod_{i=1}^{N_j} s_{jj}^{-1/2} \exp\left( - \frac{\left({\delta}_{ij} - a_j \cdot {\sigma}_{ij} - b_j \right)^2}{2s_{jj}} \right)\end{aligned} Noninformative priors are used for the model parameters [citation]: \begin{aligned} \pi \left( {\pmb s}, {\pmb a}, {\pmb b}\right) = \pi \left( {\pmb s} \right) \pi \left( {\pmb a}, {\pmb b} \right) = \prod_j^n {s_{jj}} ^ {-1/2} \cdot \left(1 + a_j^2 \right)^{-3/2}\end{aligned} The parameters $$\pmb a$$ and $$\pmb b$$ are marginalized out using Laplace’s Method:

\begin{aligned} p\left(\left\{ {\pmb\delta}\right\} \mid \left\{{\pmb \sigma}\right\},{\pmb s}\right) &=& \prod_j^n \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} p\left(\left\{ \delta_j\right\} \mid a_j, b_j, \left\{\sigma_j\right\}, s_{jj}\right) \pi \left(a_j,b_j\right) \mathrm d\, a_j \mathrm d\,b_j \\ &\propto& \prod_{j=1}^n \frac{ \exp\left( - \frac{N_j\sum_i^{N_j}{\delta}^2_{ij}-\left(\sum_i^{N_j}\delta_{ij}\right)^2-\hat a_j \left(N_j \sum_i^{N_j}\sigma_{ij}\delta_{ij} - \sum_i^{N_j}\sigma_{ij}\sum_i^{N_j}\delta_{ij}\right)}{2N_j s_{jj}} \right) \sqrt{s_{jj}}} {\left(1+\hat a_j^2\right)^{3/2}\sqrt{N_j\sum_i^{N_j}\sigma_{ij}^2-\left(\sum_i^{N_j}\sigma_{ij}\right)^2}}\end{aligned}

with $$\hat a_j$$ being the maximum likelihood estimate for the regression slope $\hat a_j = \frac {N_j\sum_i^{N_j} \sigma_{ij}\delta_{ij} - \sum_i^{N_j}\sigma_{ij} \sum_i^{N_j}\delta_{ij}} {N_j \sum_i^{N_j}\sigma_{ij}^2 - \left(\sum_i^{N_j}\sigma_{ij}\right)^2}$