Recently I’ve faced with an issue that when I run trtexec with option “–streams=14” only one CUDA stream executes all infers !!
But when I add option “–threads=14”, then 14 CUDA threads execute in parallel …
But why ? I saw that even in single thread there are used enqueueV2 API that should execute all infers in parallel !!
But to see parallel execution i need to add this “–threads” option …
Are there some limitation of enqueueV2 ? Maybe it executes in single thread in some cases ??
IterationStreams iStreams;
for (int s = 0; s < streams; ++s)
{
Iteration* iteration = new Iteration(offset + s, inference, *iEnv.context[offset], *iEnv.bindings[offset]);
...
}
the same context was used for multiple streams and according to the documentation for enqueueV2 it is undefined behavior:
Calling enqueueV2() in from the same IExecutionContext object with different CUDA streams concurrently results in undefined behavior. To perform inference concurrently in multiple streams, use one execution context per stream.
Am I right ?
But then why do you have in TensorRT such bad example with undefined behavior ?
It is not related to custom model, it is related to undefined behaviour in you trtexec example as I described above …
Running enqueueV2 on different streams is undefined behaviour, and you has such undefined behaviour in your code