FLASH ATTENTION T5 (FAT5)

While much effort has been devoted to optimising decoder transformers, thus abandoning the encoder, we believe it is essential to maintain an encoder-decoder architecture.
Indeed, this architecture, which offers interesting performance for instruction tuning, is suitable for distillation and seems superior to decoder models when finetuned. It has also been shown that encoder-decoder models trained with masked language modelling achieve better zero-shot performance after multitasking finetuning compared with a decoder model.
Beyond NLP, which is the focus of this blog post, encoder-decoder architecture is widely used in other fields such as audio or time series, for example. The encoder of such architecture is also used in some diffusion models.
That's why we've decided to focus on the T5.

This article presents the optimisations we have implemented to efficiently pre-train a T5 in French with 147M parameters in a reasonable time (1,461 H for 419B tokens) and with limited resources (1 single A100; i.e. a computing budget of around 2,200 euros). To achieve this, we designed CUDA/Triton kernels to make Flash Attention compatible with T5 and provide linear inference, thus extending the context size that can be taken into account by the model. All the optimizations applied are detailed on Hugging Face.

The pre-training code is available in our GitHub repository under Apache-2.0 license and the weights of the model trained on our Hugging Face account.

Read the full article on Hugging Face.

Share on

Twitter Facebook LinkedIn

FLASH ATTENTION T5 (FAT5)

Share on

You may also enjoy

BEYOND KNOWN THREATS: WHY WE NEED UNCERTAINTY-AWARE AI IN CYBERSECURITY

THE LLM EVALUATION GUIDEBOOK

NAMED ENTITY RECOGNITION (NER)

WORD EMBEDDING: A BASIC INTRODUCTION TO THE WORLD OF WORD VECTORS