Several research efforts have been made to convert/compress the large-scale pretrained transformer models models into efficient inference models that facilitate downstream applications. This task becomes important, as variety of autoregressive transformers have shown very high improvements in the NLP application performance baselines.
Similar to recurrent neural networks (RNNs), those models represent the context by a recurrent state with a fixed size, thereby achieving linear time and constant memory complexity in generation sequence length. The paper “Finetuning Pretrained Transformers into RNNs”, instead of training a recurrent alternative from scratch, authors convert a pretrained transformer into an efficient RNN of linear time and constant space complexity via a swap-then-finetune process.
The swap-then-finetune procedure modifies the attention computation of a pretrained transformer and finetunes the model with the task objective. The researchers first change the exponential similarity function in the attention mechanism to a single-layer MLP feature map, then finetune the MLP and other network parameters. The system had shown better results than the traditional transformer architectures in some language modeling and machine translation task.
I have tried to present and easy tutorial to explain it. The first part of this tutorial explain the Basics of the transformer’s architecture  and the second part of this tutorial explains the paper -“Finetuning Pretrained Transformers into RNNs” 
- Kasai, Jungo, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A. Smith. “Finetuning Pretrained Transformers into RNNs.” arXiv preprint arXiv:2103.13076 (2021).
- Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. “Attention is all you need.” arXiv preprint arXiv:1706.03762 (2017).