recently I’ve encountered some weird behaviour using the Jetson AGX Xavier CUDA together with PyTorch.
I am trying to perform the inference for my segmentation model. To speed the process up, I try to load my images to the CUDA device (using the
.to(torch.device("cuda"))) as soon as possible and perform all operations using PyTorch tensors. The time spent on each part of the inference pipeline is optimized, and the code is running fast separately. However, when I assembly all parts together and start to load both my model and the data to the CUDA device, some of the uploads to the CUDA device last for too long (approximately 4 seconds).
In my opinion, this might be caused by filling the CUDA cache space, which then needs to be emptied. Am I right? Is it possible to take 4 seconds to empty the CUDA cache, or could something else be slowing the process down? If so, is there some possibility to speed the cache emptying?