Applying AI and HPC to Drug Design: a Survey of R&D Work
The motivation for this report
In 2018, while at Supercomputing in Dallas I had a couple key encounters that profoundly influenced my work path ever since:
FIRST. At Nvidia, I saw a presentation about ANI1, to approximate molecules energy using deep learning. This is a precondition to accurately calculate dynamic and chemical parameters. The result is a dramatic reduction of time to solution, while achieving the same or better accuracy than using the exact numerical method. Five (yes, 5) orders of magnitude faster time-to-solution are demonstrated on a 54 atoms molecule than using Density Functional Theory. The neural network approximates DFT output data. But molecules vary in size while the input to the network must be of constant size, so the authors extended Behler-Parrinello functions and created special vectors that describe the input to the network.
SECOND. At the Department of Energy booth, I saw some results of a joint project between the DOE and the National Cancer Institute, on using deep learning to enable RAS - RAF protein calculations. This protein is responsible for an important number of cancer types and it interacts with the cell membrane. One way to explain the mechanism, or signaling pathway, is through an atomic level HPC calculation of the entire cell system. This is however not feasible even on exascale computers. Because of the widely different space and time scale between cell and atom, deep learning autoencoders were used to isolate patches of interest on which an atomic-scale calculation was then applied.
THIRD. At ISC 2019 I saw an application of variational autoencoders : Generative modeling of protein folding transitions with recurrent auto-encoder: they analyze 40 million atoms data in a lower dimensional space, and then extend it in the time domain without changing the simulation software. The output data from the molecular dynamic simulation (MD) is the input to the Convolutional Variational Auto-encoder. The network learns the encoding by minimizing the pixel-by pixel distance in the simulated contact matrix, and it captures the ensemble state of the simulation by minimizing the Kullback-Leibler divergence. Next, the simulation is extended in the time domain by extrapolating the state update, and by evolving the system in the feature space using a regressor function. Position and velocities are calculated in the decoder, which also calculates the error in relation to the MD simulation. The AI model infers the solution several orders of magnitude faster than using Molecular Dynamics calculations.
In all three cases, deep learning is essential to enable computations that are otherwise unfeasible. So I asked myself what is next, (and what may be in for me):
-
Can AI also be used for drug design?
-
How does AI work with quantum computing?
-
Can AI become a valid approximation of Density Functional Theory, or of ab-initio algorithms?
-
Who are the key players, including startups?
-
What is the analysts’ viewpoint?
The R&D efforts to deal with the pandemic are one answer to question #1 above. In 2020, there was a joint effort by the DOE, US national labs, cloud providers and some universities to tackle the virus challenge from several perspectives. What is common to cancer research: better, faster drug design.
But, before I start: why would AI be any good at drug design? It turns out, this is a complex task that human experts do not do well. It is not a task such as computer vision that humans can do, but machines can do better. The answer may be how the complex task is split up in smaller, more manageable ones. Some are examined in the next paragraphs, and a hybrid approach including AI, HPC, and potentially QC promises to be the right Ansatz.
I will go ahead and summarize R&D work that has been published between 2020 and now at the GPU Technology Conference and at the Supercomputing and International Supercomputing conferences.
Non-goals
The following topics will not be covered: omics, protein engineering, medical imaging, laboratory automation & robotics, virtual screening, virus containment and mitigation measures. Omics, protein engineering and automation may be discussed in the next document of this series because they are related to the subject matter of this survey.
Bridging the time-space scale
Doing atomic-level accurate calculations over an entire cell, or on the coronavirus spike protein is not feasible even on the largest supercomputers. Fortunately, AI comes to rescue. Only the trajectories or regions of interest are sampled for detailed numerical analysis.
-
The RAS - RAF protein calculations were amplified by Lawrence Livermore National Lab in 2020:
Machine Learning Driven Importance Sampling Approach for Multiscale Simulations.
This is a new automated way to couple two scales through a Machine Learning driven adaptive sampling approach that can focus on a user-defined hypothesis. The authors demonstrate the technique on simulations of the interactions of RAS and RAF proteins with plasma membranes in the context of cancer-signaling mechanisms. The sampling framework is capable of producing massively parallel multiscale simulations that scale to Sierra or Summit. The framework extended to support three scales: continuum, coarse-grained, and atomistic simulations of RAS and RAF proteins on plasma membranes.
Deep Learning Based Prediction of the Temporal Behavior of RAS Protein
About 30% of human cancers are caused by interactions of a mutated RAS protein with the downstream RAF near cell membranes. The DoE Cancer Pilot 2 campaign conducted numerous molecular dynamics (MD) simulations modeling RAS proteins in contact with the lipid bilayer of a cell membrane.
Transformer models such as BERT are useful: using deep learning one can extract temporal correlations from MD calculation results.
-
The spike protein dynamics in the coronavirus was investigated using a hybrid AI-HPC model:
AI-Driven Multiscale Simulations Illuminate Mechanisms of SARS-CoV-2 Spike Dynamics
They develop a generalizable AI-driven workflow that leverages heterogeneous HPC resources to explore the time-dependent dynamics of molecular systems. The goal is an accurate calculation of the SARS-CoV-2 spike protein time-dependent movement for binding with the human cell receptor: this is the main viral infection machinery. The study involves a weighted ensemble of 300 million atoms investigated with Molecular Dynamics which leads to enhanced trajectory sampling.
Designing a small molecule with antiviral properties
The authors used a VAE model with a discriminator network to generate synthetic molecules from a linear representation of a molecular input in the form of a SMILES string.
The novel character VAE, called Wasserstein autoencoder, or cWAE biases the compound search space based on a corpus of training data.
cWAE proves more efficient than a Junction Tree VAE, which generates a tree of molecular fragments connected with a molecular graph.
The system is an accurate generative model with low compound reconstruction error and improves model training time from a day to 23 minutes on a supercomputer with 97% efficiency. Directed searches are feasible for new compounds to be considered that better meet the design criteria.
This is an outstanding example of how AI can make the difference in a O(10^60) design space with those possible molecules one can build out of just a few atoms. This is also called de novo, and it differs from virtual screening which does lead optimization on known compounds. The virtual screening complexity is O(10^9) and so it just covers a tiny part of the whole universe of possible compounds.
Dealing with molecules as if it were natural language text
Language Models based on Bidirectional Encoder Representation from Transformers, or BERT are pre-trained on a large corpus of data. Then, they can be fine tuned for a specific task. Moreover, generative models like Generative Pre-trained Transformer, or GPT-3 have been developed. The next development is Large Language Models such as the Nvidia Megatron model with 0.5 trillion parameters which supports learning from fewer examples, a technique called few shots learning. These models are also successfully used for biomedical use cases. In either case the heavy lifting is the data engineering and HPC challenge to pre-train models of such a magnitude.
At GTC 2021, Astrazeneca presented this whitepaper: Machine Learning in drug discovery - how can Cambridge-1 help
They use unsupervised pre-training on a molecular database of 1.5 billion molecules: the model, MegaMolBART, is based on BART which combines BERT and GPT, and on autoencoders, and it learns from molecules expressed as Simplified Molecular Input Line Entry System, or SMILES, which is a chemical language. They use masking and data augmentation for pre-training. This is one of the most interesting capabilities: adapting a technology that was developed for natural languages. In addition, the Nvidia Megatron was used to improve the model performance and reduce the time-to-solution on the Nvidia supercomputer Cambridge-1.
In time-dependent series, one can observe a performance improvement of transformers vs. LSTM networks. Just like a deep learning model learns the basics of a language by being trained on a large corpus of data such as wikipedia, it can also learn the chemical language. After training the system can produce chemically viable compounds.
Once the model is trained it can be used for different tasks: synthesis prediction, retrosynthesis, molecular optimization, and property prediction. One can set the required drug characteristics as input, e.g. non-toxicity, and let the model work out a novel compound that fulfills those characteristics.
Nvidia CLARA and the inception program
A survey on AI R&D work for health-science cannot ignore Nvidia and the GPU Technology Conference, or GTC.
GPU computing is a must, and yet the hardware is just one layer of the solution stack; on top of it there are the CUDA libraries and the applications SDKs. Nvidia CLARA is the product for biosciences.
CLARA is an application framework that comes with a set of libraries and tools for tasks in bioscience that make optimal use of GPU computing: a must to deal with the most time consuming and challenging computation tasks.
This slides deck, presented at GTC2020 provides a great overview on genomics and drug design topics, including a Bayesian network that will be presented in the next report.
Nvidia Inception is intended to support startups, and there are many interesting companies in the bioscience space. Some are clustered around Cambridge-1 in the UK:
-
Alchemab Therapeutics Ltd : using NLP and LLM for neurodegenerative research in cancer
-
InstaDeep Ltd contributing to vaccines development
-
Peptone investigating disordered proteins
-
Relation Therapeutics doing R&D on functional genomics
Other interesting companies are:
-
Atomwise : identify the right molecule to inhibit a given protein. Here is a great interview with the founder.
The Quantum Computing (QC) holy grail: do ab initio quantum chemistry
In 2019, at a biotech conference in Poland, I met with Dr Aleksandra Kubica who is diCella CEO, an AI startup, and who did her PhD on research work on DFT and ab initio algorithms. We discussed ML approximations, however, her expert preference is for ab initio algorithms. Hence I am encouraged to go deeper in QC as the technology evolves.
At the International SuperComputing Conference 2022, there was a presentation by prof. Coveney on hybrid computing methods, showing QC for a specific part of the molecular calculation that scales with factorials. This is referred to as projection-based embeddings. The benefit of QC is that one can model large molecules that are not doable even on supercomputers.
QC is still far from production grade but the case for QC in ML and molecular chemistry is huge: for example, AWS makes a Quantum Chemistry platform available within Amazon Braket. And so I am eager to learn more as it evolves. I find this Microsoft presentation on QC of great value for learning.
McKinsey and Pfizer
A Pfizer executive said: 'In discovery, I expect AI to play a part in a substantial number of new molecules. As quantum computing is adopted more widely, we’ll see discovery and development happen at a speed we can’t yet imagine.'
Isn’t it cool?
Summary
This report is not an exhaustive one. Some areas like omics, protein engineering, Bayesian networks and automation are not covered and may be in the next document. Graph Neural Networks have not been addressed, yet they are used to represent molecules. ADMET has only been alluded to when I spoke about enforcing drug characteristics in the AI model, but you can find more detail in the Astrazeneca slides.
I chose a few whitepapers to build the case for AI for drug design. Two approaches to drug design have been reviewed: (i) based on Variational Autoencoders with a generative network, and (ii) based on BART and Nvidia Megatron. New discoveries and improvements come in at a rapid pace. Someone may say we are at an inflection point which is characterized by several technologies coming of age. AI is today capable of approximating ab initio algorithms and of dramatically accelerating the development pipeline. Drug design is expensive and error prone. Quantum Computing might be the right tool to move to the next level, and quantum chemistry is one of the sweet spots of QC. However, at the time of writing this document there are hardware issues that prevent its full fledged application. Therefore, fast and accurate approximations using Deep Learning and HPC on Cloud or on supercomputers are the way to go.
Topic: AI in Bio