An effective eval_strategy for huggingface trainer
The only options you get in eval_strategy in huggingface trainer are:
"no"
: No evaluation is done during training."steps"
: Evaluation is done (and logged) everyeval_steps
."epoch"
: Evaluation is done at the end of each epoch.
Problem with default options
The number of rows in the dataset can be a random number and is rarely perfectly divisible by a round number like 200 or 100.
So choosing steps
as strategy and setting eval_steps
to say 200 will eval at every 200 steps but it won’t eval at the end of each epoch. But that’s where the sweet spot lies in most trainings.
While fine tuning LLMs of order of 1b, 3b, 7b parameters, on datasets with number of samples of order of 5k-20k, it’s common to observe that the best performing checkpoint is at the end of epoch 1 or 2. At epoch 3 it would start overfitting the dataset.
So, it’s very desirable to evaluate the model at the end of each epoch. But setting the eval_strategy
to epoch
will not provide the details you need in your eval graph.
Solution
The solution is pure mathematical and leverage the fact the eval_steps
can also be set to a fraction of the total steps.
Here’s the formula I recommend to calculate the optimum eval_steps
in order to make sure the evaluation is frequent enough while it also captures the end of each epoch
The 0.0001 is subtracted to prevent eval_step near the end of last epoch cover going over 100% because of the rounding off of steps due to gpu distribution, batching. It might be redundant though. Please let me in the comments if it is so.
Examples:
number_of_evals_needed_per_epoch = 12, number_of_epochs = 3
eval_steps = 1 / (12*3) − 0.0001 => 0.0276
number_of_evals_needed_per_epoch = 12, number_of_epochs = 2
eval_steps = 1 / (12*2) − 0.0001 => 0.0415
Recommended config for experiment
do_train: True
do_eval: True
evaluation_strategy: steps
logging_strategy: steps
save_strategy: steps
logging_steps: 1
eval_steps: 0.0415 # good for 2 epochs
save_steps: 0.0415 # good for 2 epochs
# eval_steps: 0.0276 # good for 3 epochs
# save_steps: 0.0276 # good for 3 epochs
load_best_model_at_end: True
save_total_limit: 4 # last 3 checkpoints + best checkpoint are saved
Hope it’s helpful for someone hoping to find a way to evaluate at both the steps and epochs.