cudaEventSynchronize waits for what?


An object detection algorithm runs in a loop by tensorrt backend. I share the profiler output. I have two questions.

1- I see some long waiting cudaEventSynchronizes in the visual profiler. How can I avoid them? What should be the root cause of these long waits?

2- I expect the cudaEventSynchronize end-points to match with the end-point of the last compute unit but there is a small gap in each call. Is this expected or the same faulty behavior seen in (1)?

Here is the tensorrt calling code snippet

    bool status = mInfEnv.context.back()->enqueue(batch, mInfEnv.bindings.back()->getDeviceBuffers(), getStream(StreamType::kCOMPUTE).get(), nullptr);
    if (!status)
        LOG_S(WARNING) << "TRT inference failed";


TensorRT Version:
GPU Type:
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

out.nvvp (4.8 MB)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered