Excessive RAM usage

Hello, I have a few questions regarding RAM usage. We are operating on the jetson Xavier NK w/ Jetpack 4.6

We have 2 projects running models in pytorch, one that produces artifacts that are a dependency for the other. Both are hungry for memory resources, so we are in the process of working through resource management. What I am observing is that after running inference with either model, the gpu memory allocation remain the same. Ie when I create the model and load the weights it takes approx. 500mb of GPU memory. When I run inference, it consumes 1.1G. And after inference it remains at 1.1G until I shut the container down. I had been assuming that I had retained references to tensors somewhere in the code that I needed to dereference, but I wrote a method to check tensors and delete the non-model parameter tensors in memory. Yet the GPU allocation remains. So, I have a couple of questions:

  1. Is there an explicit way to determine all tensors in memory and weather they are model parameters. I’m currently examining all objects in the garbage collector, determining if they are tensors, then determining if they are of type torch.nn.parameter or type torch.device and deleting if neither.
  2. Is there perhaps a setting similar to PYTORCH_NO_CUDA_MEMORY_CACHING that I should be defining. My understanding is that this is bad practice in production due to additional latency. But in a resource limited environment where multiple containers need to pass GPU resources back and forth or other method of preventing memory allocated at inference from being retained.
  3. When running jtop I see that the memory allocated for each process doesn’t add up to the total memory used. Below is an example of what I am seeing from jtop ( i assume ~2G are from OS)
    a. Baseline (nvargus-daemon & symbot_server): 30mb mem and 300mb gpu each – Total RAM used 2.7G
    b. Baseline + (Project 1 Model Loaded: .7G cpu, .7G gpu) – Total RAM used 5.1G
    c. Baseline + (Project 2 Model Loaded: .7G cpu, .9G gpu) – Total RAM used 4.8G
    d. Baseline + (Projects 1 & 2 Model Loaded: 1.1G cpu, 1.5G gpu) – Total RAM used 7G
    e. Baseline + (Both Loaded and Running inference) – RAM overflow
  4. Running docker stats lists the memory usage to be different, often smaller than the memory listed in jtop. jtop agrees with tegrastats.


PyTorch uses CUDA for inference so most of the memory (~600Mb) is occupied by loading the CUDA-related library (especially cuDNN).
This memory won’t be free by deleting the tensor or parameters. It’s used for CUDA initialization.

Is TensorRT an option for you?
If you already have an ONNX format model, it can be inference with TensorRT easily.
TensoRT has several mechanisms that can help to control memory (tradeoff between performance).
For example, you can set up the workspace or the backend (cuDNN, cuBLAS, …) based on the limited resources.

Below is an example of TensorRT 8.5 (JetPack5).
It might have some API differences with TensorRT 8.2 (JetPack 4) but should be very similar:




Thank you for the tips. So I got onnx version of our node up and running and tested with both the CPU provider and the TensorRT provider without altering the memory options.

CPU provider reduces the memory requirements by 1.5G and TensorRT actually increases by a significant margin which confuses me a bit. Why would TensorRT take more memory than the pytorch model where it is loading the entire cuda library?


Do you run it with ONNXRuntime or PyTorch?
If yes, could you try to run it with trtexec instead?


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.