[Question] trtexec understanding issue

Hi all,

Recently I’ve faced with an issue that when I run trtexec with option “–streams=14” only one CUDA stream executes all infers !!

But when I add option “–threads=14”, then 14 CUDA threads execute in parallel …

But why ? I saw that even in single thread there are used enqueueV2 API that should execute all infers in parallel !!
But to see parallel execution i need to add this “–threads” option …

Are there some limitation of enqueueV2 ? Maybe it executes in single thread in some cases ??

Seems like it happens, because:

    IterationStreams iStreams;
    for (int s = 0; s < streams; ++s)
        Iteration* iteration = new Iteration(offset + s, inference, *iEnv.context[offset], *iEnv.bindings[offset]);

the same context was used for multiple streams and according to the documentation for enqueueV2 it is undefined behavior:

Calling enqueueV2() in from the same IExecutionContext object with different CUDA streams concurrently results in undefined behavior. To perform inference concurrently in multiple streams, use one execution context per stream. 

Am I right ?

But then why do you have in TensorRT such bad example with undefined behavior ?

Seems like I have found similar issue multi-stream parallel execution with one GPU ERROR · Issue #846 · NVIDIA/TensorRT · GitHub

Please refer to the below link for Sample guide.

Refer to the installation steps from the link if in case you are missing on anything

However suggested approach is to use TRT NGC containers to avoid any system dependency related issues.

In order to run python sample, make sure TRT python packages are installed while using NGC container.

In case, if you are trying to run custom model, please share your model and script with us, so that we can assist you better.


It is not related to custom model, it is related to undefined behaviour in you trtexec example as I described above …
Running enqueueV2 on different streams is undefined behaviour, and you has such undefined behaviour in your code