Streaming with GPU and DLA

Hello,
I created a plan where the first few layers are run in DLA and the remaining layers are run in GPU. I wanted to know how I can use this plan for streaming data in parallel. What I mean is this:
The first image is processed by DLA then GPU. When the first image is in the GPU, the second image is in the DLA. When the second image reaches the GPU, the third image enters the DLA, and so on. Essentially, after the first instance, at any point, I am using the DLA and GPU at the same time. Like the image below:

How can I do this, and which application should I use to prove that it is working in parallel? (Note that both the output and input are images)

I tried using nsys to see the output for the plan. Following the documentation (Developer Guide :: NVIDIA Deep Learning TensorRT Documentation), I do not see the other accelerators API in the output. I have added the link to the files as well.
Link to files

Thank you.

Hi,

You will need to use TensorRT API to specify certain layers run on DLA:

Or use cuDLA library:

Thanks.

Hello,
Thank you for the response.
I have used TensorRT Python library to set the device type to DLA for the first few layers and created the plan from this. What I would like to know is how to achieve the streaming step where I am able to execute the model in parallel. My input and output are images.

Hi,

Please run the multiple models within the same process but with different threads and CUDA streams.
GPU scheduler will optimize the tasks based on resources.

Thanks.

I am not sure I understood. Do you have a sample file that I could follow to do this?

Hi,

Please find the sample below:

Thanks.

Thank you for the sample. I followed the documentation to get the profiling data along with the sample and got the following results: (I used a plan that had the first ten layers executed in DLA and the remaining in GPU.)

  1. Without streaming.


    no_streaming.txt (23.4 KB)

  2. With streaming of 2.


    with_streaming.txt (28.5 KB)

I have also attached the logs from execution.

I can see an improvement in the throughput, but I cannot see how it is executing in parallel like I showed previously. The Nsight Systems shows a serial execution in both cases. Is there something I am missing?

Also is it possible to use a folder containing all the images as input for the model during while using trtexec? From what I understood from the sample, I can only use a single file that has been converted to a binary format.