Summary of “Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping (15 Feb 2020)”


The key takeaway

The random seed used to initialise the weights and select the order of the training data can have a significant effect on the performance of a fine-tuned pre-trained language model such as BERT. An early stopping method can be used to exploit this fact by starting many experiments with different seeds and discarding the least promising ones early on.


The low-down

The performance obtained by fine-tuning pretrained language models can vary significantly based on the random seed used to initialise the weights and to select the order of the training data.

To analyse the impact of this phenomenon, the authors of the paper fine-tuned BERT on four downstream tasks in the GLUE benchmark by varying only the random seed (keeping all hyperparameters fixed).

Experiment setup:

  • Fine-tuning BERT for 3 epochs, using the pretrained weights and a randomly initialised final layer (2048 params) for the classification task
  • 4 different datasets: 3 smaller ones (2.5k-8.6k training samples) and a larger one (67k training samples)
  • Each experiment repeated N2 times, covering all combinations of N seeds for weight initialisation and N seeds for data order
    (for the 3 small datasets: N=25, for the large one: N=15)

Results:

  • First of all, they found that by checking many different random seeds they obtained significant improvements on the performance of BERT compared to previous published results, in some cases even comparable to more recent models. This led them to argue that “model comparisons that only take into account reported performance in a benchmark can be misleading”.
  • Some seeds (for both weight initialisation and data order) are statistically more likely to diverge. In addition, when it comes to weight initialisation, it appears that there are some seeds that perform really well across different tasks. For example, they found one seed that gave the best performance on two of the datasets, and the second and third best performance on the other two. This indicates that there was something inherently “good” about that weight initialisation.

Applications:

Given limited computational resources, the authors suggest that better performance can be achieved by using an early stopping method: that is, starting many experiments with different random seeds, evaluating frequently on the validation set and stopping the least promising trials early on. In fact, they found that performance early in training was highly correlated to performance late in training. This was particularly true on the three smaller datasets, where the performance evaluated after even less than a full epoch was often a good predictor of future performance.