Description
An object detection algorithm runs in a loop by tensorrt backend. I share the profiler output. I have two questions.
1- I see some long waiting cudaEventSynchronizes in the visual profiler. How can I avoid them? What should be the root cause of these long waits?
2- I expect the cudaEventSynchronize end-points to match with the end-point of the last compute unit but there is a small gap in each call. Is this expected or the same faulty behavior seen in (1)?
Here is the tensorrt calling code snippet
// INFERENCE
getStream(StreamType::kCOMPUTE).wait(getEvent(EventType::kEND_OF_INPUT));
bool status = mInfEnv.context.back()->enqueue(batch, mInfEnv.bindings.back()->getDeviceBuffers(), getStream(StreamType::kCOMPUTE).get(), nullptr);
if (!status)
{
LOG_S(WARNING) << "TRT inference failed";
abort();
}
mEvents[static_cast<int32_t>(EventType::kEND_OF_COMPUTE)].record(getStream(StreamType::kCOMPUTE));
getEvent(EventType::kEND_OF_COMPUTE).synchronize();
Environment
TensorRT Version:
GPU Type:
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):
Relevant Files
out.nvvp (4.8 MB)
Steps To Reproduce
Please include:
- Exact steps/commands to build your repro
- Exact steps/commands to run your repro
- Full traceback of errors encountered