Optimize fine tuning of a Citrinet model in multi GPU environment

Hello support,

we are attempting to move from a single GPU to multi GPU training environment.
The subject of training is the finetuning of a Citrinet-1024 model for speech recognition.

We executed a first fine tuning session on a single GPU machine (a single V100 with 16GB of memory), now we are moving to a new machine with 4 GPUs (4 T4 with 16GB of memory each).
The first training session featured a batch_size of 16 and a learning rate of 0,025.

The script we prepared for multi GPU fine tuning performs the following tasks:

  • loads the pre-trained model
  • changes some configuration parameters, especially learning rate (due to the increment of GPU number)
  • instantiates the Trainer object
   gpuN = 4
   epochs = 300
   accelerator_mode = "ddp"
   withLogger = False
   withCheckpointCallback=False
    
   trainer = pl.Trainer(gpus=gpuN, max_epochs=epochs, accelerator=accelerator_mode, logger=withLogger, 
checkpoint_callback=withCheckpointCallback)
  • starts the training

In order to benefit the increased hardware capacity we intended to keep the per GPU batch size to 16, thus obtaining an effective batch size of 64, but we get an OOM error.
We attempted decreasing the per GPU batch size, the greatest value avoiding OOM error is 12.

Observing the output of nvidia-smi command while training is running, we see that GPU0 has more memory allocated than the other 3, so maybe it is bootlnecking the others, causing the OOM.

Thu Oct 28 10:23:36 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000001:00:00.0 Off |                    0 |
| N/A   68C    P0    63W /  70W |  14623MiB / 15109MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000002:00:00.0 Off |                    0 |
| N/A   65C    P0    67W /  70W |  13232MiB / 15109MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000003:00:00.0 Off |                    0 |
| N/A   72C    P0    72W /  70W |  13226MiB / 15109MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000004:00:00.0 Off |                    0 |
| N/A   65C    P0    69W /  70W |  13282MiB / 15109MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Is it correct?

Are we doing something wrong? Is there a way to distribute load in equally manner for all GPUs in order to maximize the benefits?
Do you have a tutorial/notebook or some article focusing about best practices for multi GPU training?

Thank you!
Francesco