Issue in making streams concurrent

Here are the details of the platform I am using:

  • Linux distro and version - Linux Ubuntu 18.04
  • GPU type - Jetson AGX Xavier
  • CUDA version - 10.0
  • CUDNN version - 7.3
  • TensorRT version - 5.0
  • OpenCV version - 3.3.1

Here’s what I am trying to achieve:

foo doTensorrtStuff()
      // build CUDA engine
      // parse network
      // build model
      cudaStreamCreateWithFlags ( &pStream, cudaStreamNonBlocking)
     // call to context enqueue to perform inference with pStream  
     return results;
int main()
// read image
std::future<foo> fut = std::async (doTensorRTStuff);
foo bar = fut.get();
return 0;

[b]The issue here is, I don’t see the TensorRT stuff and OpenCV stuff happening concurrently on the GPU (device) when inspected with NVIDIA Visual Profiler, even though they’re launched in separate threads on the host. The streams are still serialized.

  • Is this drawback of TensorRT?
  • Can someone shed light on what exactly is happening on the GPU?
  • [/b]


    The task queued in the same CUDA stream will be executed in sequence.
    Have you link OpenCV and TensorRT with the different stream?

    Another possible issue is the workload of TensorRT.
    If the TensorRT already occupies all the GPU resource, the concurrent execution will be impossible.

    You can check this by monitoring the GPU utilization.

    sudo ./tegrastats



    I have experimented with two combinations but none of them executed concurrently. I tried,

    • TensorRT Inference + OpenCV CUDA Operations in separate threads, created cudaStreamCreateWithFlags cudaStreamNonBlocking for TensorRT and an Instance of OpenCV Stream for OpenCV GPU operations
    • Only OpenCV CUDA Operations Launched In Different Threads with instances of OpenCV Stream so that operations do not happen in default stream.

    I do not see any sign of concurrency at all. I have tested this on Xavier initially and later on Titan V to see if there’s a limitation of resources but that doesn’t seem to be the case. The only code which shows concurrent execution is the HyperQ demo code from Nvidia CUDA samples. How do I replicate it in my context?


    To give a further suggestion, could you share your source with us?

    Hello AastaLLL,

    The source code is proprietary and I won’t be able to share it here. But I am sure replicating my scenario shouldn’t be tough. Here’s how you can replicate what I am facing:

    1. Invoke any 2 GPU functions of OpenCV in threads
    2. Perform TensorRT inference (using jetson-inference repo or from any of the TensorRT samples)

    You can switch the functions you are invoking in threads - invoke inference in a thread and invoke OpenCV GPUs after that (as posted in the snippet above).

    These functions aren’t being executed concurrently on the Device.


    Before checking this, could you share the system log from tegrastat with us?
    In most of the case, TensorRT occupies nearly all of the GPU resources and may prevent other application from concurrent execution.


    Hi AdithyaP,

    Is there any update on this issue or not a problem now?