I am running inferences on Nvidia Jetson Nano, which has the tegra X1 chip, and I am wondering if multiple cuda steams are supported on there in the gpu hardware. For example I want to run a heavy inference network taking 300ms, but I want to be able to concurrently run another tensorrt inference network that only takes 50ms. Now the insight profiler shows that my second concurrent network is waiting for the first one, that is already running, to end. I have different cuda streams created per network. I also tried the cudaStreamCreateWithPriority but that did not work. The lower prio was 0 and the highest is -1. But the result was the same. The running network did not get preempted by the newly started inference with higher priority.
Is it at all possible what I am trying to do here using the nvidia jetson nano? How many cuda streams are possible? I saw that concurrentKernels is 1 and asyncEngineCount is also 1 in the device properties.
I found in the cuda part of the forum indications that it can’t be done. the docs say “may” preempt when using another priority cuda stream, but it is’n guaranteed to preempt, so I will have to find another solution.
I run 1 process, and each tensorrt network is in a separate context and each has it’s own cudaStream, and all memory is allocated using cudaHostMalloc and then inference is enqueud for each stream in separate threads.
After some more testing with nsight: It does run concurrently as stated in the docs, but really preempting is not possible if I understood correctly from other posts in the cuda sections. So now as workaround we are aligning the inferences in time so they interfere with each other as minimal as possible.
Thanks for the replies!