Strategy: how to overcome GPU Out-of-Memory?

I understand that the Jetson Nano has a max of 4096MB memory available for the GPU, and SWAP-space cannot be used for GPU.
Some of the 4096MB memory is used for ‘non-GPU’ functions.
If I am not mistaken, memory assigned to the GPU cannot be released.

So I wonder what are best practices to have a maximum amount of memory available for the GPU.
E.g.

  • how to ‘assign’ other (CPU-) processes to SWAP,
  • configurations for common tools like OpenCV, Tensorflow.
  • headless modes,
  • is there some ‘trick’ to release GPU-memory between operations …

Looking at the amount and frequency of related questions, a ‘Sticky Topic’ might be very helpful

I can’t answer all of it, but having swap implies those processes which can use swap will. Despite the GPU being unable to use swap there would still be an indirect benefit.

Often CUDA (or any program) will use more than one thread/kernel. This uses more memory. Any CUDA application will use less memory if you use fewer kernels (and if the application itself uses fewer threads).

Headless modes still use a buffer, but the buffer does not have a monitor attached.

Thanks!
My current ‘best’ is:
using a lot of swap, with swapiness=100
Jupyterlab via Headless
Reboot before starting something ‘big’

If I understand correctly, in Tensorflow I can pre-allocate or reserve a certain amount of RAM for the GPU following the example from https://www.tensorflow.org/guide/gpu.
So I would assume that after this TF ‘knows’ that a maximum of eg. 1GB of RAM is reserved for GPU. But why is TF still crashing with OOM for GPU and not using the CPU if the reserved memory is not sufficient for GPU operation?

(Jetson Nano with JP4.4. and TF2.1.1)

I couldn’t answer that, and someone who knows more will need to answer. I can think of one possibility though…typically the memory needs to be contiguous, and if you have enough space, but that space is not contiguous, then the space still cannot be used. Many direct hardware access devices are unable to use fragmented memory.

@linuxdev
Thanks for your reply!
Not sure if this is the issue. After a reboot I have about 3.2GB free RAM, so I would assume at least 1GB should be contiguous. Also TF should ‘complain’ during the allocation and not crash later.

I am really confused …
If I tell TF not use the GPU execution is (of cause) very slow but it does not crash. If I use the GPU
(with or without pre-allocation) TF crashes with OOM. Why is the GPU trying to use more memory than pre-allocated? I would hope it uses the amount of RAM pre-allocated, and the CPU is used for other operations.

GPUs often need buffers aligned and sized differently than a CPU (this is one reason why they are faster.) This means that the memory requirements can go up.
Another possibility is that the TF GPU interface uses CPU copies of the data, that it then uploads to the GPU, basically doubling the RAM requirements when you use the GPU like this. (This is why TensorRT and other nano-specific APIs are a good idea.)

Thanks for your comments!
I have been trying the examples from https://devblogs.nvidia.com/speeding-up-deep-learning-inference-using-tensorflow-onnx-and-tensorrt/ sometime ago and I am facing the same OOM issues.
Eg. loadSemanticSegmentation.py: first I see 'Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1024 MB memory’, and a few seconds later the program stops with 'tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[7,7,512,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:RandomUniform]’

Slowly getting frustrated …

If you’re already using TensorRT, and the model doesn’t fit, then the model doesn’t fit. The Jetson Nano only has 4 GB of RAM to share between CPU and GPU. The general-purpose “AI” part of NVIDIA often runs on desktop GPUs, or even on fancy multi-GPU setups. The blog post you pointed at is just general deep learning based, not specific to the Jetson Nano.