Investments in artificial intelligence (AI) for drug discovery are surging. Big Pharmas are throwing big bucks at AI. Sanofi signed a 300 Million dollars deal with the Scottish AI startup Exscentia, and GSK did the same for 42 Million dollars. Also, the Silicon Valley VC firm Andreessen Horowitz launched a new 450 Million dollars bio investment fund, with one focus area in applications of AI to drug discovery.
In this craze, lots of pharma and biotech decision-makers wonder whether they should jump on the bandwagon, or wait and see.
In this post, I am presenting results of an independant critical analysis of several research publications in the field of AI, which suggests that AI researchers tend to overhype their achievements.. This practice seems to be widespread, in my opinion, and for illustration purposes, I picked research from one "big pharma" company, AstraZeneca, two academic labs -- at Harvard and Stanford, and one privately held company, Insilico Medicine -- as some of their research results are publicly available and suitable for evaluation.
In an innovative environment like drug discovery, accessing quality innovations is key for success, since first-movers enjoy a huge competitive advantage. However, it is hard to assess the quality in a field as complex as data science. A good compromise is then to employ third-party independent counter-expertise along the way -- to perform technical due-diligence.
One such resource is Startcrowd -- an online network of AI experts and enthusiasts, well-positioned to deliver independent expertise. We tap into an emerging talent pool, educated on online courses. It keeps Startcrowd far from the conflicts of interests peculiar for industrial environment.
This counter-expertise is a way to prevent new disappointments with computer-aided methods. Senior executives probably remember the epic failure of rational drug design in the 1980’s. At that time, big Pharmas like Merck were promising the next industrial revolution. It didn’t happen the way they anticipated.
I am optimistic that things can be different today. Not even because of the novel AI technology innovations, but because outsourcing business models can now be improved: finding and involving stronger subject-matter expert to evaluate research is now easier, in the era of online education and social networks.
Now let’s get into the technical part, with some examples where I provide subject matter critics of AI research, suggesting that it might be overhyped.
In this paper, AstraZeneca researchers (joint with others) want to generate novel molecules using recurrent neural networks. This question is important because a creative AI should bring more diversity to the lead generation process, and ultimately, AI could substantially improve the work of human medicinal chemists.
This paper caught my attention because of the large part devoted to the evaluation of the model. It gives the appearance of depth. Various metrics are introduced, based on Tanimoto-similarity and on Levenshtein distance. They provide an impressive number of visualizations, using histograms, Violin plots and t-SNE.
However, all their measures are made between AI-generated molecules on the one hand, and natural molecules on the other hand. They always ‘omitted’ to measure the distance of AI-generated molecules with each other. This omission allows to build an illusion of diversity: a large distance between AI-generated and natural molecules allows to think that the AI got creative, and that it explored new directions in the chemical space. Graphically, it lets think that we got something like this:
Real diversity: AI-generated molecules in blue, natural molecules in red.
However, if the distance of AI-generated molecules with each other remains small, it means that we fell into the trivial situation where the model generated a stream of molecules located all at the same place. No diversity is generated, and we are actually left in a situation like this:
No diversity: there is still a large diversity between the AI-generated molecule (blue) and the natural molecules (red).
Simply put, this AstraZeneca paper misses the elephant in the room. While these new molecules might be for some extent interesting for the research purposes, it is certainly not a breakthrough achievement. This problem is still not addressed in more recent papers by other AstraZeneca researchers (here and here).
For a more technical discussion, see pages 6–7 of my paper here.
A lab at Harvard University
A research team at Harvard noticed this diversity issue. By visually looking at the output of the AI, they felt something was going wrong. They tried to do something about it, and they proposed the ORGAN model, here and here.
Their idea is to bring more chemical diversity, and chemical realism, by correcting the generator with a second neural network, called discriminator. It penalizes the generator if the molecules look too unnatural. This idea is drawn from the literature in Generative Adversarial Networks (GAN), a hyped topic within the AI community
This idea is interesting, but their execution is far from perfect. They conclude that their ORGAN is better, but this claim is only based on their personal visual observation, without any quantitative support (see page 3 of their paper). Their quantitative experiments don’t support their conclusion.
This had to be expected somehow, because as researchers at AstraZeneca, they only compare AI-generated molecules with natural ones, and they never compare AI-generated molecules with each other.
Moreover, the way they train their model is problematic. This can be seen by looking at the log file of their training. Their discriminator always highly penalizes the generator. They have a perfect discriminator problem, which essentially nullifies any practical benefit of using GAN.
This perfect discriminator problem might have pre-existed in the SeqGAN paper, on which ORGAN is built. It is uknown, however, because the SeqGAN team didn’t make their training log file public, and reproduction of the results is therefore problematic.
A more technical discussion is available in my paper, pages 5–6. I tweeted my paper to Alan Aspuru-Guzik, the ORGAN team leader. He answered:
We will check out and give you comments over email or respond as adecuare. Thanks for letting us know.— Alan Aspuru-Guzik (@A_Aspuru_Guzik) August 31, 2017
The response is still pending, and hopefully will clarify the situation.
A lab at Stanford University
Stanford has a big lab dedicated to AI and deep learning for chemistry. It is led by Vijay Pande, who is also a startup investor at the Andreessen Horowitz fund. Their flagship project is MoleculeNet, a ‘benchmark specially designed for testing machine learning methods of molecular properties’. The project is accompanied with numerous chemical compounds, lots of graphics, and lots of deep learning models. In particular, a large place is devoted to graph convolutions and other chemistry-specific neural networks, developed by this Stanford team.
However, there’s an elephant in this room too: the Pande team did not publish any results of plugging their data into a character-level Convolutional Neural Network. Char-CNN is a classical model used in AI for text processing, and it is much simpler than their chemistry-specific models. To use char-CNN, it suffices to represent molecules as SMILES text strings.
Why didn’t they do it? At page 17 of their paper, we can read:
“Recent work has demonstrated the ability to learn useful representations from SMILES strings using more sophisticated methods, so it may be feasible to use SMILES strings for further learning tasks in the near future.”
However, they use char-CNN in another paper. The reasons they did not publish results of any comparison with char-CNN in the above case are unclear.
It should be noted, that MoleculeNet is closely related with the DeepChem package, which implements all the MoleculeNet models. DeepChem is open-source and Stanford-led. If char-CNN are better than graph-CNN, then practitioners don’t need to adopt DeepChem. They can simply remain with plain TensorFlow or PyTorch.
I tried to use DeepChem in my project, until I realized that I couldn’t mix DeepChem models and non-DeepChem models. That would have been useful for adversarial training, with a DeepChem discriminator and a non-DeepChem generator. On the contrary, I got completely locked-in DeepChem code, which was puzzling. In order to get out of this trap, and to make DeepChem really open, I had to dig into complex code (my cleaned-up fork of DeepChem is here). It will be much more difficult to do that for a more mature project. So my impression was one of a technology lock-in.
Insilico Medicine is a pioneer in generative models, among AI startups. In this paper they propose ‘DruGAN, an advanced generative adversarial autoencoder model’. In my opinion, this statement is overvalued.
The model suffers from the same flaws as the other generative models, which may lead to limitatons in the innovative potential as applied to drug discovery projects.
It does not seem to be advanced with respect to Variational Auto-Encoders (VAE) either. In the paper, authors claim that DruGAN is better than VAE, but on Github, one DruGAN author acknowledges a contradicting opinion:
Actually, we didn’t tune VAE network as much as AAE, so it isn’t very fair comparison. I mean that one can introduce upgrades for VAE and outperform our AAE.
All-in-all, it seems that DruGAN is only advanced with respect to the results of their previous paper, published 8 months earlier. The improvement of the system versus the general state-of-the-art is not obvious.
In conclusion, many researchers in AI for drug discovery, in my opinion, tend to overhype their results and I believe, the actual value stated by the AI-driven companies for their pharma/biotech customers needs to be assessed in each particular case by the external expertise. Such external expertise should be an important component in the outsourcing business models for the biopharmaceutical industry.