An effective eval_strategy for huggingface trainer

2 min readJust now

Image Credits: https://medium.com/nlplanet/bert-finetuning-with-hugging-face-and-training-visualizations-with-tensorboard-46368a57fc97

The only options you get in eval_strategy in huggingface trainer are:

"no": No evaluation is done during training.
"steps": Evaluation is done (and logged) every eval_steps.
"epoch": Evaluation is done at the end of each epoch.

Problem with default options

The number of rows in the dataset can be a random number and is rarely perfectly divisible by a round number like 200 or 100.

So choosing stepsas strategy and setting eval_stepsto say 200 will eval at every 200 steps but it won’t eval at the end of each epoch. But that’s where the sweet spot lies in most trainings.

While fine tuning LLMs of order of 1b, 3b, 7b parameters, on datasets with number of samples of order of 5k-20k, it’s common to observe that the best performing checkpoint is at the end of epoch 1 or 2. At epoch 3 it would start overfitting the dataset.

So, it’s very desirable to evaluate the model at the end of each epoch. But setting the eval_strategy to epoch will not provide the details you need in your eval graph.

Solution

The solution is pure mathematical and leverage the fact the eval_steps can also be set to a fraction of the total steps.

Here’s the formula I recommend to calculate the optimum eval_steps in order to make sure the evaluation is frequent enough while it also captures the end of each epoch

The 0.0001 is subtracted to prevent eval_step near the end of last epoch cover going over 100% because of the rounding off of steps due to gpu distribution, batching. It might be redundant though. Please let me in the comments if it is so.

Examples:

number_of_evals_needed_per_epoch = 12, number_of_epochs = 3eval_steps = 1 / (12*3) − 0.0001 => 0.0276

number_of_evals_needed_per_epoch = 12, number_of_epochs = 2eval_steps = 1 / (12*2) − 0.0001 => 0.0415

Recommended config for experiment

do_train: True
do_eval: True
evaluation_strategy: steps
logging_strategy: steps
save_strategy: steps
logging_steps: 1
eval_steps: 0.0415 # good for 2 epochs
save_steps: 0.0415 # good for 2 epochs
# eval_steps: 0.0276 # good for 3 epochs
# save_steps: 0.0276 # good for 3 epochs
load_best_model_at_end: True
save_total_limit: 4  # last 3 checkpoints + best checkpoint are saved

Hope it’s helpful for someone hoping to find a way to evaluate at both the steps and epochs.

An effective eval_strategy for huggingface trainer

Problem with default options

Solution

Examples:

Recommended config for experiment

Written by Gundeep Singh

No responses yet