Efficient and Scalable Transfer Learning for Natural Language Processing
Author | : Kevin Stefan Clark |
Publisher | : |
Total Pages | : |
Release | : 2021 |
ISBN-10 | : OCLC:1243112741 |
ISBN-13 | : |
Rating | : 4/5 (41 Downloads) |
Book excerpt: Neural networks work best when trained on large amounts of data, but most labeled datasets in natural language processing (NLP) are small. As a result, neural NLP models often overfit to idiosyncrasies and artifacts in their training data rather than learning generalizable patterns. Transfer learning offers a solution: instead of learning a single task from scratch and in isolation, the model can benefit from the wealth of text on the web or other tasks with rich annotations. This additional data enables the training of bigger, more expressive networks. However, it also dramatically increases the computational cost of training, with recent models taking up to hundreds of GPU years to train. To alleviate this cost, I develop transfer learning methods that learn much more efficiently than previous approaches while remaining highly scalable. First, I present a multi-task learning algorithm based on knowledge distillation that consistently improves over single-task training even when learning many diverse tasks. I next develop Cross-View Training, which revitalizes semi-supervised learning methods from the statistical era of NLP (self-training and co-training) while taking advantage of neural methods. The resulting models outperform pre-trained LSTM language models such as ELMo while training 10x faster. Lastly, I present ELECTRA, a self-supervised pre-training method for transformer networks based on energy-based models. ELECTRA learns 4x--10x faster than previous approaches such as BERT, resulting in excellent performance on natural language understanding tasks both when trained at large scale or even when it is trained on a single GPU.