Understanding the Innovations From LSTM, to an Encoder/Decoder Model, to Transformers; and Their Impact on Health Science

by Joseph Pareti       Biopharma insight / Biopharma Insights

Disclaimer: All opinions expressed by Contributors are their own and do not represent those of their employers, or BiopharmaTrend.com.
Contributors are fully responsible for assuring they own any required copyright for any content they submit to BiopharmaTrend.com. This website and its owners shall not be liable for neither information and content submitted for publication by Contributors, nor its accuracy.

Topics: Emerging Technologies   
Share:   Share in LinkedIn  Share in Reddit  Share in X  Share in Hacker News  Share in Facebook  Send by email

AstraZeneca has achieved outstanding results in drug design using large language models applied to SMILES representation of molecules; but what are the steps to understand how this is possible?  

In this report, I am going to describe my effort on an LSTM-based encoder/decoder model and on transformers. I would like to show that these technologies are related, can be learnt starting with a simple case, and they are not only relevant for NLP, but also for health science.

On a simple sequence-to-sequence problem, an encoder/decoder model outperforms LSTM, and a further improvement comes from teacher forcing at the cost of added complexity. Next, one needs to understand embeddings: non-contextual embeddings, as in word2vec, and contextual BERT embeddings. And finally one needs to understand the whitepaper ‘Attention is all you need’ that is state-of-the-art for NLP tasks.

The sources I use include video tutorials, experimentation with python code on LSTM, a hands-on training course on Open Source BERT-based models, and an LDA model.

I hope my work can be useful to others who want to gain a deeper understanding on NLP for health sciences.


Encoder/Decoder Models for a sequence to sequence case.

This exercise helps understanding time-dependent neural networks, LSTM and an encoder/decoder model using a sequence to sequence example.

The training videos and code are intellectual property of Prof. Karakaya and are available in youtube and as colab notebooks: [1], [2], [3] . Details on my experimentation are in  [3a].

Given a fixed-length sequence, the goal of the model is to predict the reverse sequence.

In [1], the task is done using 2 LSTM layers connected together, and a final dense layer that predicts the highest probability outcome using softmax. The model is implemented in Tensorflow/Keras which makes it easy to define and modify the data flow. It is shown that the predictive performance increases as the information transfer across the 2 LSTM layers increases, while keeping the model size constant. The best results are obtained when the second LSTM layer is initialized with all hidden states for each time step of the first layer, and with the last cell state of the first layer: this is intuitive. It is less obvious that better performance is achieved without increasing the number of free parameters. The exercise shows that information transfer is important, and this is also the case for transformers.

In [2], the model implements an encoder and a decoder, where the encoder is a LSTM layer that creates a context vector.  The decoder is also implemented as a LSTM layer that is initialized with the context vector and generates one character at a time. In a loop, token values for successive time steps are supplied, starting with a START token, followed by the output of the token decoded in the previous time step, and the state vector of the previous step, until a STOP condition is met. The decoder feeds the output to the dense layer and softmax for probability calculation.

The encoder/decoder model delivers better performance than the simple LSTM model in [1]. I see it as a stepping stone towards transformers, because this architecture is common with more advanced models.

A further enhancement,  [3] calls for augmenting the encoder/decoder model with teacher forcing. At training, the decoder is provided with the right value in lieu of the last-generated value. However, at inference the decoder is provided with the last decoded output as the input for the current time step. It means there are 2 sets of models, one for training and one for inference. This makes it more complicated. Is the better performance worth the added complexity over a plain encoder/decoder model?

For additional insights on teacher forcing one can use [10].


Transformers and BERT

The simplest embeddings are determined using Word2vec or GloVe. These are context-independent [4] but provide a significant advantage over a word vector representation using one-hot-encoding because they have the notion of word distance in the latent space. These embeddings also have significantly lower dimensions than one-hot-encoding.

To make embeddings context aware, RNNs and LSTMs have been used but the temporal dependency among tokens makes the training serial and hence inefficient. They are also negatively affected by vanishing gradients as the sentence length increases. Reverse LSTM can be used to detect context that cannot be otherwise detected, e.g. a left-to-right LSTM cannot disambiguate the word bank in  a sentence like ‘the bank of the river’ because the word river occurs after the word bank.

LSTMs support context-dependent embeddings and are the precondition for further progress, and that is why I focused a lot on  [1], [2], [3].

Transformers are the next innovation: each word in a sentence attends to each other word. Using embeddings for the query, key, and value vectors the self attention is built. It contains unknown 2D matrices which are applied to the query vector and to the key vector to obtain vectors on which a dot product is computed. - slides #37-#50 [4]. The dot product result goes into softmax for the probabilities calculation which is then applied to the value vector in order to extract the most likely output token. Similarly to multi layers in a CNN, each with their filters, one defines multiple attention heads, each with their parameters that detect specific facts in the input. Outputs from attention heads are collapsed into a dense layer for softmax calculations.

Transformers can be parallelized in the sentence direction because the temporal dependency of RNN is replaced by the attention algorithm. The complexity is O(N**2), where N is the number of words in a sentence, because each word attends to each other word. Parallelization along the sentence is enabled by GPUs.

Continue reading

This content available exclusively for BPT Mebmers

Topics: Emerging Technologies   

Share:   Share in LinkedIn  Share in Reddit  Share in X  Share in Hacker News  Share in Facebook  Send by email