What shall I try when calling CUDA APIs on CPU becomes the bottle neck?


I have network which consists of multiple kernels. Although TRT fuses many of them, there are still many kernel launches. I can see, with the help of Nsight System and NVTX, that CPU has become the bottle neck, since the GPU computations are often waiting for CPU callings. Suggestions are appreciated.

(The network uses dynamic shapes, so CUDA Graph may not be very helpful.)


TensorRT Version: 8.0
GPU Type: T4
Nvidia Driver Version: 460
CUDA Version: 11
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):


Unfortunately, the solutions are quite limited. Here are some ideas:

  • Increase BS to increase the workload density.
  • Capture one graph per each possible dynamic shape (like for each BS) and select which graph to use at runtime depending on input shapes.
  • Build multiple engines, one for each shape, and load all of them at runtime and run with CUDA graphs.
  • etc.

Thank you.

Please check the below link, as they might answer your concerns


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.