I have network which consists of multiple kernels. Although TRT fuses many of them, there are still many kernel launches. I can see, with the help of Nsight System and NVTX, that CPU has become the bottle neck, since the GPU computations are often waiting for CPU callings. Suggestions are appreciated.
(The network uses dynamic shapes, so CUDA Graph may not be very helpful.)
Environment
TensorRT Version: 8.0 GPU Type: T4 Nvidia Driver Version: 460 CUDA Version: 11 CUDNN Version: Operating System + Version: Python Version (if applicable): TensorFlow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if container which image + tag):