loading page

Deep Imputation on Large-Scale Drug Discovery Data
  • +3
  • Benedict Irwin,
  • Thomas Whitehead,
  • Scott Rowland,
  • Samar Mahmoud,
  • Gareth Conduit,
  • Matthew Segall
Benedict Irwin
Optibrium Ltd

Corresponding Author:[email protected]

Author Profile
Thomas Whitehead
Intellegens Ltd
Author Profile
Scott Rowland
Takeda Oncology
Author Profile
Samar Mahmoud
Optibrium Ltd
Author Profile
Gareth Conduit
Intellegens Ltd
Author Profile
Matthew Segall
Optibrium Ltd
Author Profile


More accurate predictions of the biological properties of chemical compounds would guide the selection and design of new compounds in drug discovery and help to address the enormous cost and low success-rate of pharmaceutical R&D. However this domain presents a significant challenge for AI methods due to the sparsity of compound data and the noise inherent in results from biological experiments. In this paper, we demonstrate how data imputation using deep learning provides substantial improvements over quantitative structure-activity relationship (QSAR) machine learning models that are widely applied in drug discovery. We present the largest-to-date successful application of deep-learning imputation to datasets which are comparable in size to the corporate data repository of a pharmaceutical company (678,994 compounds by 1166 endpoints). We demonstrate this improvement for three areas of practical application linked to distinct use cases; i) target activity data compiled from a range of drug discovery projects, ii) a high value and heterogeneous dataset covering complex absorption, distribution, metabolism and elimination properties and, iii) high throughput screening data, testing the algorithm’s limits on early-stage noisy and very sparse data. Achieving median coefficients of determination, R2, of 0.69, 0.36 and 0.43 respectively across these applications, the deep learning imputation method offers an unambiguous improvement over random forest QSAR methods, which achieve median R2 values of 0.28, 0.19 and 0.23 respectively. We also demonstrate that robust estimates of the uncertainties in the predicted values correlate strongly with the accuracies in prediction, enabling greater confidence in decision-making based on the imputed values.
11 Jan 2021Submitted to Applied AI Letters
15 Jan 2021Submission Checks Completed
15 Jan 2021Assigned to Editor
20 Jan 2021Reviewer(s) Assigned
04 Mar 2021Review(s) Completed, Editorial Evaluation Pending
19 Mar 2021Editorial Decision: Revise Major
27 Apr 20211st Revision Received
28 Apr 2021Submission Checks Completed
28 Apr 2021Assigned to Editor
04 May 2021Reviewer(s) Assigned
19 May 2021Review(s) Completed, Editorial Evaluation Pending
19 May 2021Editorial Decision: Accept