Introduction

A large proportion of organic molecules relevant to medicine and biotechnology contain one or more ionizable groups, which means that fundamental physical and chemical properties (e.g. the charge of the molecule) depend on the pH of the surroundings via the corresponding pKa values of the molecules. As drug- and material design increasingly is being done through high throughput screens, fast - yet accurate - computational pKa prediction methods are becoming crucial to the design process.

There are several empirical pKa prediction tools (e.g. ACD pKa DB, Chemaxon, and Epik) that offer predictions in less than a second and can be used by non-experts. These methods are generally quite accurate but often fail for classes of molecules that are not found in the underlying database. \citet{Settimo_2013} have recently shown that the empirical methods are particularly prone to failure for amines, which represent a large fraction of drugs currently on the market or in development. These underlying databases are not public and it is therefore difficult to anticipate when empirical methods will fail. Furthermore, the user is generally not able to augment the databases for cases where they are found to fail rendering the empirical methods essentially useless for certain molecular design projects.

pKa values can be predicted with significantly less empiricism using electronic structure theory (QM) \cite{Ho_2014}. The accuracy of these QM-based predictions appear to rival that of the empirical approaches, but a direct comparison on a common set of molecules has not appeared in the literature and most QM-based pKa prediction studies have focused on relatively small sets of simple benchmark molecules. One notable exception to the latter statement is the study by \citet{Eckert_2005} who computed the pKa by

\[\mathrm{pK_a}=c_1\frac{\Delta G^\circ}{RT\ln (10)} + c_2 + (N_C - 1)\]

where \(\Delta G^\circ\) denotes the change in standard free energy for the reaction

\[\mathrm{ BH^+ + H_{2}O \rightleftharpoons B + H_{3}O^+ }\]

and is approximated as the sum of the electronic and solvation free energy. \(N_C\) - 1 is an empirical correction accounting for the observation that the the method systematically underestimated the pKa of secondary (\(N_C\) = 2) and tertiary (\(N_C\) = 3) amines by ca 1 and 2 pH units, respectively. Using this approach the pKa values of 58 drug-like molecules containing one or more ionizable N atoms could be reproduced with a root mean square deviation (RMSD) of 0.7 pH units. However, the method relies on conformer search at the BP/TZVP level of theory which is computationally too expensive for routine use in screening and design.

Semiempirical QM (SQM) methods are many orders of magnitude faster than conventional QM but their application to small molecule pKa prediction has been very limited and have focused mainly indirect prediction using atomic charges \cite{Stewart_2008,Ugur_2014}. The most likely reason for this is that SQM methods give significantly worse pKa predictions if used with an arbitrary reference molecule such as H2O (Rxn 1). However, we \cite{Li_2004} and others \cite{Li_1997,Govender_2010,Sastre_2012} have shown that a judicious choice of reference molecule is a very effective way of reducing the error in pKa predictions. Here we show that this approach is the key to predict accurate pKa values using our PM6-D3H+ SQM method \cite{Kromann_2014} combined with the SMD solvation method \cite{Marenich_2009}.