Jetson AGX Xavier: slow inference using CUDA and PyTorch


recently I’ve encountered some weird behaviour using the Jetson AGX Xavier CUDA together with PyTorch.

I am trying to perform the inference for my segmentation model. To speed the process up, I try to load my images to the CUDA device (using the .to(torch.device("cuda"))) as soon as possible and perform all operations using PyTorch tensors. The time spent on each part of the inference pipeline is optimized, and the code is running fast separately. However, when I assembly all parts together and start to load both my model and the data to the CUDA device, some of the uploads to the CUDA device last for too long (approximately 4 seconds).

In my opinion, this might be caused by filling the CUDA cache space, which then needs to be emptied. Am I right? Is it possible to take 4 seconds to empty the CUDA cache, or could something else be slowing the process down? If so, is there some possibility to speed the cache emptying?


A common bottleneck is from the data transfer between different stages.
Sometimes, it takes times if an intermediate tensor size is large.
It may cost more if any swap memory is used.

To get a more detailed information, would you mind to evaluate your usecase with our profiler first?



thanks for your answer. I’ve installed the NVIDIA Nsight Compute from your link and followed the documentation to connect it to the Jetson and launch the target application on it. However, after the process is successfully launched, it seems that the Nsight Compute get stuck on searching for attachable processes. The output of the console looks like this:

Have I misunderstood something? Should I use a different profiling tool or change something in the way of launching the target process?


You will need to use the Nsight Compute integrated in the JetPack.
Please install the CUDA package for host and find it in /opt/nvidia/nsight-compute/2019.5.0.