Saturday, March 23, 2024

Book Report: Machine Learning for Drug Discovery

Drug Discovery is a field where biochemists (and more recently computer scientists) turn ideas into potential medications. I first came across a few applications in this area when checking out how to build Graph Neural Networks (GNN) as part of auditing the CS224W: Machine Learning with Graphs course from Stanford, some learnings of which I recycled into my Deep Learning with Graphs tutorial at ODSC 2021. Of course, drug discovery is much more than just GNNs, I mention this only because this happened to be my entry point into this fascinating world. However, I will hasten to add that despite having made an entrance, I am still parked pretty solidly close to the entrance (or exit, depending on your point of view).

But I am always looking to learn more about stuff I find interesting, so when I was offered a chance to review Dr Noah Flynn's Machine Learning for Drug Discovery published by Manning, I jumped on it. The book is currently in MEAP (Manning Early Access Program) so currently there are only 5 chapters available, but once the book is completed, there are going to be 15 chapters in all. The intended audience of the book, as the title suggests, are computational biochemists, i.e. the ones who attempt to solve Drug Discovery problems using Machine Learning. Thus, to become a computational biochemist, there are two main ways -- either you are a biochemist and you learn the ML, or you are a ML person and you learn the biochemistry. The book is aimed at both categories of readers.

As someone in the latter category, I had to spend much more time on the biochemistry aspects. I suspect that most readers of this review would also fall into this category. For them, I would say that while the ML part is sophisticated enough to solve the problem at hand, they are methods and practices that should be familiar to most ML people already. The most useful things that I think you would get out of this book are as follows:

  • Framing the Drug Discovery problem as a ML problem
  • Preprocessing and Encoding inputs
  • Getting data to train your ML model

For the first one, you either need to have a biochemistry background yourself, or you need to pair with someone who does. I suppose you could get by with a life sciences or chemistry background as well, or acquire enough biochemistry knowledge over time in this field, and this book may even take you part of the way there, but be aware that the learning curve is steep.

For the second and the third items, I thought the book was super useful. Most chapters are built as case studies around a Drug Discovery problem, so as you go through the chapters, you will learn about the sites to acquire your datasets from, and the techniques to preprocess the data from these sites into a form suitable for consumption by your ML model. At least the first 5 chapters deal with fairly simple ML models, but which may or may not be familiar to you depending on your industry, so you might also learn a few things about evaluating or tuning these models that you didn't know before (I did).

The first chapter introduces the reader to the domain and talks about the need for computational approaches to Drug Discovery. It introduces the terminology and the RDKit software library, an open-source cheminformatics toolkit the provides implementations of many common operations needed for computational Drug Discovery (sort of like a specialized supplement to Scikit-Learn for general ML). It also covers high level rules of thumb for detecting drug compounds, such as Lipinski's rule of 5. It then covers some common use cases common in Drug Discovery, ranging from Virtual Screening to Generative and Synthetic Chemistry. It also covers some popular (and public) repositories for Chemistry data, such as ChEMBL, PubChem, Protein Data Bank (PDB), etc.

The second chapter demonstrates Ligand based Screening, where you already have a reference molecule with some of the desired properties, and you want to search the chemical space for molecules similar to that one, with the objective of finding more drugs like the one you started with. The case study here is to identify potential anti-malarial compounds. The dataset for this comes packaged with RDKit itself as Structure Definition Files (SDF) which describes each molecule using a SMILES (Simplified Molecular Input Link Entry System) string. The chapter walks us through converting the SMILES to MOL format, then using RDKit to extract specialized chemical features from the MOL and SMILES, preprocessing to filter out uninteresting molecules based on rule based thresholds such as bio-availability, molecular weight, etc, structure based thresholds such as toxicity, and specific substructural patterns (similar to subgraph motifs). It then uses RDKit to generate Morgan fingerprints out of the remaining molecules (MOL). Morgan (and other) fingerprints are similar to embeddings in NLP, except that they encode structural information through a more deterministic process, and are hence more explainable than embeddings. Finally, these fingerprints are compared with the reference molecule using Tanimoto similarity and the nearest neighbors found.

Chapter 3 continues with the problem of Ligand based screening, but tries to predict cardiotoxicity of the anti-malarial compounds found in the previous chapter using a linear model. This is done indirectly by predicting if the compound blocks the hERG (or gene potassiuam) channel, then it is cardiotoxic, and vice versa. A linear model (Scikit-Learn SGD CLassifier) is trained using the hERG dataset from the Therapeutic Data Commons (TDC). The chapter shows some Exploratory Data Analysis (EDA) on the data, using standard preprocessing as described in the previous chapter. An additional step here is to standardize (regularize) the data for classification. The author provides the biochemistry reasoning for behind this step, but uses the implementation already provided by RDKit. Finally Morgan fingerprints are used to train the SGD Classifier. Because the elements of Morgan fingerprints have meaning, the weights of the resulting SGD model can be used to determine feature importances. There is also some discussion here of cross validation, L1/L2 regularization, removing collinearity, adding interaction terms and hyperparameter sweeps.

Chapter 4 explores building a linear regression model to predict solubility, i.e. how much of the drug would be absorbed by the system. The dataset used to train the regressor is the AqSolDB, also from TDC. This chapter introduces the idea of scaffold splitting, a technique common with biochemical datasets that preserves the structural / chemical similarity within each split. It also briefly describes outlier removal at the extremes, which requires chemistry knowledge. The RDKit library is used to extract features from the dataset, and the model trained to minimize the Mean Squared Error loss. The RANSAC (RANdom SAmple Consensus) technique is introduced that makes models more robust to outliers. On the ML side, there is some discussion on the bias-variance tradeoff and Learning / Validation curves.

The fifth and last chapter of the MEAP (at the time of writing this review) deals with predicting how well the body will metabolize the drug. Typically, drugs are broken down into enzymes in the liver, a large proportion of which are collectively known as the Cytochrome P450 superfamily. As before, metabolism is predicted indirectly by whether the drug inhibits Cytochrome P450 -- if it does, then it will not get metabolized easily, and vice versa. The dataset used to train the model is the CYP3A4 dataset, also from TDC. Data is prepared using the same set of (by now) standard pipeline and the classifier trained a binary predictions of whether the input inhibits Cytochrome P450 or not. The chapter discusses the utility of Reliability Plots in Performance Evaluation and Platt scaling for calibrating probabilities. It also talks about how to deal with imbalanced datasets, Data Augmentation, Class Weights and other approaches to deal with class imbalance. Various models are trained and evaluated, and their important features identified and visualized with RDKit Similarity Map. The chapter ends with a short discussion on Multi-label classification.

The pandemic and the rapid discovery of the COVID vaccine gave a lot of us (at least those of us that were watching) a ringside view into the fascinating world of drug discovery. This book provides yet another peek into this world, with its carefully crafted case studies and examples. Overall, I think you will learn a lot about drug discovery if you go through this book, both on the biochemistry side and the ML side. There are exercises at the end of each chapter, doing these would help you get more familiar with RDKit and hopefully more effective at computational drug discovery.