cudaEventSynchronize waits for what?

Description

An object detection algorithm runs in a loop by tensorrt backend. I share the profiler output. I have two questions.

1- I see some long waiting cudaEventSynchronizes in the visual profiler. How can I avoid them? What should be the root cause of these long waits?

2- I expect the cudaEventSynchronize end-points to match with the end-point of the last compute unit but there is a small gap in each call. Is this expected or the same faulty behavior seen in (1)?

Here is the tensorrt calling code snippet

    // INFERENCE
    getStream(StreamType::kCOMPUTE).wait(getEvent(EventType::kEND_OF_INPUT));
    bool status = mInfEnv.context.back()->enqueue(batch, mInfEnv.bindings.back()->getDeviceBuffers(), getStream(StreamType::kCOMPUTE).get(), nullptr);
    if (!status)
    {
        LOG_S(WARNING) << "TRT inference failed";
        abort();
    }
    mEvents[static_cast<int32_t>(EventType::kEND_OF_COMPUTE)].record(getStream(StreamType::kCOMPUTE));
    getEvent(EventType::kEND_OF_COMPUTE).synchronize();

Environment

TensorRT Version:
GPU Type:
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

out.nvvp (4.8 MB)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered