Slow inference using CUDA and PyTorch on Jetson AGX


We’re having a bit of a hiccup with using pytorch + CUDA on Jetson AGX. We have two object tracking models, used on all other devices with the same execution time, but when used on AGX, the 1st model runs perfectly fine and has very fast processing speed, however the 2nd model seems to be ~8x slower. Is the problem I’m facing the same as this post Jetson AGX Xavier: slow inference using CUDA and PyTorch? And is there any way I can do it quickly?

The last recommendation from NVIDIA in the thread you are pointing to was to install the CUDA profiler and see what it diagnoses the top bottleneck to be. Did you do that and if so, what were the results?

What are “all other devices”? Are they discrete GPUs or integrated solutions like the Jetson AGX? If it is the former, that would seem to have no bearing on this case, as it would be comparing apples and oranges. If it’s the latter (i.e. other integrated platforms), that could be quite relevant and you might want to list here what integrated platforms work well for this use case.

used on all other devices → I was tested on GPU 1660, 1660supper, 2070 and Jetson NX. All working normally.
But when I bring my torch_jit_model into Jetxon AGX, or RTX 3060, my model 2 run infer slower ~x8 time with model 1.
I am trying to use Nsight Systems to check the result, but I dont have any experience with this, so I can’t understand what the insight system returns .
Here is the results of Nsight Systems (running on 3060 - I believe it has the same behavior with Jetson AGX)

The first 8 seconds are when I run model 1, and the later times are when I run model 2