When training a deep learning model in a parallel multi-gpu environment, say e.g. 4xA100 DGX, profiled and visualized in tensorboard, have found large part of time is spent in kernel launch, see kernel launch takes too much time 1(some steps in the middle of training) and kernel launch takes too much time 2(the overall training).
And most of the time gpus are idling, see gpus are idling.
Could anyone please inform of what features or optimization options could be enabled on the CUDA platform to reduce the delay caused by kernel launch and low gpu utilization rate? And is it possible that this problem is related only to some specific kernels?