Transformer to RNN/ T2RNN

One comment

Several research efforts have been made to convert/compress the large-scale pretrained transformer models models into efficient inference models that facilitate downstream applications. This task becomes important, as variety of autoregressive transformers have shown very high improvements in the NLP application performance baselines.

Similar to recurrent neural networks (RNNs), those models represent the context by a recurrent state with a fixed size, thereby achieving linear time and constant memory complexity in generation sequence length. The paper “Finetuning Pretrained Transformers into RNNs”, instead of training a recurrent alternative from scratch, authors convert a pretrained transformer into an efficient RNN of linear time and constant space complexity via a swap-then-finetune process.

The swap-then-finetune procedure modifies the attention computation of a pretrained transformer and finetunes the model with the task objective. The researchers first change the exponential similarity function in the attention mechanism to a single-layer MLP feature map, then finetune the MLP and other network parameters. The system had shown better results than the traditional transformer architectures in some language modeling and machine translation task.

I have tried to present and easy tutorial to explain it. The first part of this tutorial explain the Basics of the transformer’s architecture [2] and the second part of this tutorial explains the paper -“Finetuning Pretrained Transformers into RNNs” [1]

Covers the Basics of Transformer Architecture – will be useful in understanding The paper “Finetuning Pretrained Transformers into RNNs”
Explains the paper “Finetuning Pretrained Transformers into RNNs”


  1. Kasai, Jungo, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A. Smith. “Finetuning Pretrained Transformers into RNNs.” arXiv preprint arXiv:2103.13076 (2021).
  2. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. “Attention is all you need.” arXiv preprint arXiv:1706.03762 (2017).

1 comments on “Transformer to RNN/ T2RNN”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.