Strategy: how to overcome GPU Out-of-Memory?

mictiemix · May 30, 2020, 8:59am

I understand that the Jetson Nano has a max of 4096MB memory available for the GPU, and SWAP-space cannot be used for GPU.
Some of the 4096MB memory is used for ‘non-GPU’ functions.
If I am not mistaken, memory assigned to the GPU cannot be released.

So I wonder what are best practices to have a maximum amount of memory available for the GPU.
E.g.

how to ‘assign’ other (CPU-) processes to SWAP,
configurations for common tools like OpenCV, Tensorflow.
headless modes,
is there some ‘trick’ to release GPU-memory between operations …

Looking at the amount and frequency of related questions, a ‘Sticky Topic’ might be very helpful

linuxdev · May 30, 2020, 5:29pm

I can’t answer all of it, but having swap implies those processes which can use swap will. Despite the GPU being unable to use swap there would still be an indirect benefit.

Often CUDA (or any program) will use more than one thread/kernel. This uses more memory. Any CUDA application will use less memory if you use fewer kernels (and if the application itself uses fewer threads).

Headless modes still use a buffer, but the buffer does not have a monitor attached.

mictiemix · May 31, 2020, 7:59am

Thanks!
My current ‘best’ is:
using a lot of swap, with swapiness=100
Jupyterlab via Headless
Reboot before starting something ‘big’

mictiemix · June 18, 2020, 4:08am

If I understand correctly, in Tensorflow I can pre-allocate or reserve a certain amount of RAM for the GPU following the example from Use a GPU | TensorFlow Core.
So I would assume that after this TF ‘knows’ that a maximum of eg. 1GB of RAM is reserved for GPU. But why is TF still crashing with OOM for GPU and not using the CPU if the reserved memory is not sufficient for GPU operation?

(Jetson Nano with JP4.4. and TF2.1.1)

linuxdev · June 18, 2020, 8:49pm

I couldn’t answer that, and someone who knows more will need to answer. I can think of one possibility though…typically the memory needs to be contiguous, and if you have enough space, but that space is not contiguous, then the space still cannot be used. Many direct hardware access devices are unable to use fragmented memory.

mictiemix · June 19, 2020, 2:10am

@linuxdev
Thanks for your reply!
Not sure if this is the issue. After a reboot I have about 3.2GB free RAM, so I would assume at least 1GB should be contiguous. Also TF should ‘complain’ during the allocation and not crash later.

I am really confused …
If I tell TF not use the GPU execution is (of cause) very slow but it does not crash. If I use the GPU
(with or without pre-allocation) TF crashes with OOM. Why is the GPU trying to use more memory than pre-allocated? I would hope it uses the amount of RAM pre-allocated, and the CPU is used for other operations.

snarky · June 19, 2020, 3:30am

GPUs often need buffers aligned and sized differently than a CPU (this is one reason why they are faster.) This means that the memory requirements can go up.
Another possibility is that the TF GPU interface uses CPU copies of the data, that it then uploads to the GPU, basically doubling the RAM requirements when you use the GPU like this. (This is why TensorRT and other nano-specific APIs are a good idea.)

mictiemix · June 19, 2020, 1:42pm

Thanks for your comments!
I have been trying the examples from https://devblogs.nvidia.com/speeding-up-deep-learning-inference-using-tensorflow-onnx-and-tensorrt/ sometime ago and I am facing the same OOM issues.
Eg. loadSemanticSegmentation.py: first I see 'Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1024 MB memory’, and a few seconds later the program stops with 'tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[7,7,512,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:RandomUniform]’

Slowly getting frustrated …

snarky · June 19, 2020, 8:15pm

If you’re already using TensorRT, and the model doesn’t fit, then the model doesn’t fit. The Jetson Nano only has 4 GB of RAM to share between CPU and GPU. The general-purpose “AI” part of NVIDIA often runs on desktop GPUs, or even on fancy multi-GPU setups. The blog post you pointed at is just general deep learning based, not specific to the Jetson Nano.

Topic		Replies	Views
GPU out of memory when the total ram usage is 2.8G Jetson TX2	28	18550	October 18, 2021
Jetson Nano most GPU memory is not available Jetson Nano	3	5544	September 4, 2019
Issues with RAM usage Jetson Nano	5	1436	October 14, 2021
TensorFlow model not using swap memory even when using cpu Jetson Nano tensorflow	3	766	December 22, 2021
when i run a tensorflow model there is not enough memory ,what shoud i do Jetson TX2	12	7913	September 1, 2021
Script killed Jetson Nano	16	4541	October 18, 2021
CPU RAM vs GPU RAM Jetson Nano kernel	8	2984	October 18, 2021
Resource exhausted: OOM when allocating tensor with shape[256] Jetson Nano tensorrt , tensorflow	6	5308	October 18, 2021
The problem that swap keeps increasing in jetson nano Jetson Nano kernel	3	479	March 7, 2024
General Question about Jetsons GPU/CPU Shared Memory Usage Jetson TX2	35	7513	October 18, 2021

Strategy: how to overcome GPU Out-of-Memory?

Related topics