Multi-GPU Training time is slower than single-GPU

Hi,

I am training a tacotron2 model with 8-GPUs. It takes about 5.5 days to complete. On the other hand, the same training takes 4.5 days to train on a single-GPU. The GPUs’ memory is utilised to full extent in both the cases. I also see that the data transfer isn’t happening via NVLink.

Is there any NCCL flags to be set so that the training time can be reduced?

Thanks.

ubuntu@ip-172-31-2-61:~$ sudo nvidia-smi nvlink -g 0
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-82c2ec5d-7ade-3e06-acfc-cef1cb8a1e70)
Link 0: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 1: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 2: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 3: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 4: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 5: Rx0: 0 KBytes, Tx0: 0 KBytes
GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-cefb4dce-aa97-70f8-3820-2692e3efaac7)
Link 0: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 1: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 2: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 3: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 4: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 5: Rx0: 0 KBytes, Tx0: 0 KBytes
GPU 2: Tesla V100-SXM2-16GB (UUID: GPU-233dbfe5-8e41-c620-b7b1-1305e2826559)
Link 0: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 1: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 2: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 3: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 4: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 5: Rx0: 0 KBytes, Tx0: 0 KBytes
GPU 3: Tesla V100-SXM2-16GB (UUID: GPU-e01d51d6-c34d-7af2-1084-9ee437ac4c73)
Link 0: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 1: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 2: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 3: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 4: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 5: Rx0: 0 KBytes, Tx0: 0 KBytes
GPU 4: Tesla V100-SXM2-16GB (UUID: GPU-397e3847-ee0c-fb9c-9ab7-8927f74229cb)
Link 0: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 1: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 2: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 3: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 4: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 5: Rx0: 0 KBytes, Tx0: 0 KBytes
GPU 5: Tesla V100-SXM2-16GB (UUID: GPU-a5df4422-00f9-6b2b-4451-02726949b72d)
Link 0: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 1: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 2: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 3: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 4: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 5: Rx0: 0 KBytes, Tx0: 0 KBytes
GPU 6: Tesla V100-SXM2-16GB (UUID: GPU-f7a532ce-a4a2-5e4d-958c-abce4b09ed9e)
Link 0: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 1: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 2: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 3: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 4: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 5: Rx0: 0 KBytes, Tx0: 0 KBytes
GPU 7: Tesla V100-SXM2-16GB (UUID: GPU-6618b08d-0d90-c128-4e61-f18a2714b2c3)
Link 0: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 1: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 2: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 3: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 4: Rx0: 0 KBytes, Tx0: 0 KBytes
Link 5: Rx0: 0 KBytes, Tx0: 0 KBytes