Once we have our model loaded, we need to grab the training hyperparameters within the stored model. For the purposes of fine-tuning, the authors recommend choosing from the following values (from Appendix A.3 of the BERT paper):

  • Batch size: 16, 32 (->32)
  • Learning rate (Adam): 5e-5, 3e-5, 2e-5 (->2e-5)
  • Number of epochs: 2, 3, 4 (->6 to check the why)
  • Epsilon parameter: 1e-8 is the default

You can find the creation of the AdamW optimizer in run_glue.py here. It’s recommended to use a scheduler to control the learning weight decay by taking larger steps at the beginning and smaller ones once the model has learned something to avoid missing the optimum value.