Hello support,
we are attempting to move from a single GPU to multi GPU training environment.
The subject of training is the finetuning of a Citrinet-1024
model for speech recognition.
We executed a first fine tuning session on a single GPU machine (a single V100 with 16GB of memory), now we are moving to a new machine with 4 GPUs (4 T4 with 16GB of memory each).
The first training session featured a batch_size
of 16 and a learning rate
of 0,025.
The script we prepared for multi GPU fine tuning performs the following tasks:
- loads the pre-trained model
- changes some configuration parameters, especially
learning rate
(due to the increment of GPU number) - instantiates the
Trainer
object
gpuN = 4
epochs = 300
accelerator_mode = "ddp"
withLogger = False
withCheckpointCallback=False
trainer = pl.Trainer(gpus=gpuN, max_epochs=epochs, accelerator=accelerator_mode, logger=withLogger,
checkpoint_callback=withCheckpointCallback)
- starts the training
In order to benefit the increased hardware capacity we intended to keep the per GPU batch size
to 16, thus obtaining an effective batch size
of 64, but we get an OOM error.
We attempted decreasing the per GPU batch size
, the greatest value avoiding OOM error is 12.
Observing the output of nvidia-smi command while training is running, we see that GPU0 has more memory allocated than the other 3, so maybe it is bootlnecking the others, causing the OOM.
Thu Oct 28 10:23:36 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000001:00:00.0 Off | 0 |
| N/A 68C P0 63W / 70W | 14623MiB / 15109MiB | 99% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000002:00:00.0 Off | 0 |
| N/A 65C P0 67W / 70W | 13232MiB / 15109MiB | 99% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 On | 00000003:00:00.0 Off | 0 |
| N/A 72C P0 72W / 70W | 13226MiB / 15109MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 On | 00000004:00:00.0 Off | 0 |
| N/A 65C P0 69W / 70W | 13282MiB / 15109MiB | 99% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Is it correct?
Are we doing something wrong? Is there a way to distribute load in equally manner for all GPUs in order to maximize the benefits?
Do you have a tutorial/notebook or some article focusing about best practices for multi GPU training?
Thank you!
Francesco