How to run pytorch custom inference on Jetson Nano's GPU?

I installed pytorch on my Nano 2GB with the script delivered in “Jetson Inference” repository and tutorial. When I check in python torch is installed and cuda.is_available() gives me True.

I prepared some small benchmark code to see if this works as it should, where I run ione image inference through pretrained alexnet.

alexnet = models.alexnet(pretrained=True)
alexnet.eval()

input_batch = torch.randn((1, 3, 224, 224))

start_time = time.time()
output_batch = alexnet(input_batch)
print(time.time() - start_time)

On my laptop CPU torch (i5 10th gen) it takes 0.04 second, on Jetson nano it takes more than 0.4 s.

When I moved both model and data batch .to(torch.device(“cuda”)) the script ended up with inferencing this one image for 46 seconds (?!?!?!).

I don’t think this is normal. How can I use pytorch models in proper way on my Nano?

I might add, that for example on “detectnet” demo I get over 20 FPS so I think the installation of pytorch, jetpack and cuda libraries is correct.

If this was the first inferencing iteration of the program, it takes longer because it needs to load and initialize a bunch of CUDA libraries the first time a GPU operation is performed in PyTorch. Try discarding the time of the first and timing a bunch of iterations after that.

Hello, thanks for the response.

I modified the script so it runs couple of times and execution time is decreasing with time as you suggested:
-51 s
-1.39 s
-0.15 s
-0.10 s
-0.05 s
-0.03 s

However, I also put into the time measured part of the code, the line of code that moves data to the torch.device(“cuda”), because it seems to me, that I’d have to do it each time I conduct inference. Without measuring time of moving the tensor into the cuda device, the process takes only 0.008 s.

So I also would like to ask you whether putting tensor to cuda with the use of Tensor.to(torch.device(“cuda”)) is recommended method on Nano or maybe there is some faster alternative to this?

Unless you can modify or re-use a tensor in-place that has already been allocated on the GPU, it would seem to be necessary. You could also try using PyTorch’s APIs for pinned memory: https://pytorch.org/docs/stable/notes/cuda.html#use-pinned-memory-buffers

In reality, PyTorch isn’t the most optimized library for realtime inferencing and as such there are faster alternatives such as TensorRT (like the jetson-inference library uses). jetson-inference is also careful to use zero-copy memory to avoid needing CPU/GPU memory transfers or the overhead of allocating memory at runtime.