I’m trying to use my Jetson Nano with PyTorch by using this official docker image: nvcr.io/nvidia/l4t-pytorch:r32.7.1-pth1.10-py3
However, when I move anything to the “GPU” memory, it allocates all the memory + swap making it unusable.
Want to reproduce it, just try python3 -c "import torch; torch.rand(1).cuda();" from inside the container.
According to tegrastats, the memory peaks at RAM 1846/1980MB (lfb 2x512kB) SWAP 580/5086MB.
When I use pycuda directly I can allocate memory, do whatever I want, and no crazy allocations happen. I even tried trtexec to convert a model and ran it using pycuda + tensorrt, no problem again. It’s only PyTorch. I even tried older versions (r32.6.1-pth1.9-py3), but it was the same problem.
Any help will be much appreciated, thanks in advance!
Hi, thank you very much for your fast reply, but I don’t think this is the case because it works as expected when I do NOT use the GPU with PyTorch (no calls to .cuda() or .to(device='cuda') ). I tried PyTorch with big models and as long as I don’t use CUDA anywhere (therefore everything on the CPU), it just works without this crazy memory leakage (allocation) behaviour.
I tried the official docker images down to PyTorch 1.7, but as soon as I send anything to CUDA the memory blows up. I think someone from Nvidia ignored the existence of the Jetson Nano 2GB and it’s allocating 4GB by default when using CUDA.
As I mentioned in my first message, when I run a huge model using the GPU, but WITHOUT using PyTorch, it works perfectly. A model converted from onnx using trtexec and inference using pycuda and tensorrt just works. Therefore it’s not my Jetson Nano’s fault :)
It really feels like someone from Nvidia decided to make sure Jetson Nano 2GB must not be used anymore, but ecologically and economically that would be really absurd, so I will stick to the idea that it’s a silly bug that Nvidia should fix asap.
@ricardo.azambuja what it’s doing is loading the huge amount of CUDA kernel code that PyTorch has compiled (PyTorch only does this the first time you actually use GPU). It is not just allocating blank memory, but alas many of those PyTorch kernels go unused (so they can be paged out if you have sufficient swap). Unfortunately PyTorch doesn’t selectively implement a way to only load the needed kernels, and it’s not an NVIDIA bug. For deployment and optimized memory/runtime usage, it’s recommended to export models from PyTorch (typically via ONNX) and run them with TensorRT.