Unified memory allocation and swap

There are many documents on performance of unified memory but it is all focused on how to optimize copy time or even avoid it:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#performance-tuning

Instead I am looking in a corner-case of the Jetson board running out of RAM and starting to swap memory. Based on my experiments with L4T 32.4.4, the logic seems simple but I would like some confirmation/correction or maybe pointer to relevant docs:

  • If regular CPU RAM is available, system and most of processes just use that as much as possible.
  • This could lead to situation there almost all RAM is consumed but swap is still empty.
  • DeepStream nvinfer loads and starts requesting RAM which should be accessible to GPU.
  • The kernel tries to allocate real RAM for that since swap could not be accessed from GPU. For that, rest of idle RAM used by other processes starts to be swapped out. Since there is available memory, out of memory error is not generated. Instead, syscall for that allocation is patiently waiting for this to finish which could take anything up to seconds in extreme cases.
  • If this requests are repeated because we need different blocks for different layers, this could take up to few minutes in extreme cases.

I saw similar question here but it was dropped without explanation by author: A few really important questions regarding Jetson memory system

I do not know if what follows applies to this case, but I suspect it does…

Most of the time the GPU must use physical RAM. The uses I know of are unable to use swap space/virtual memory. Sometimes this also includes the requirement that the RAM be contiguous, and that remapping via the memory manager will not get past this.

When you see documentation about zero copy I suspect it only applies to physical RAM, at least on the GPU side. Maybe someone else can verify if this is such a case.

1 Like

Hi,

Please find the Jetson’s memory document below:

Physical memory is shared between the CPU and the GPU, and the GPU can only use the physical memory.

Thanks.

@AastaLLL I checked the link you provided but it again delves into description of different memory types and which one is the fastest. Word “swap” is not even mentioned there once.
My question is about expected Jetson behavior on allocating GPU memory while there is not enough physical memory but still plenty of swap. I am used to getting out-of-memory errors on discrete GPUs but Jetson seems to be doing something more complicated.

Swap only helps if it releases unfragmented physical memory for the GPU. Any fragmentation of physical memory from use by the CPU also fragments that memory for the GPU. Swap is virtual memory mapped by the memory controller, none of which can be used directly by the iGPU.

If the use of swap allows some CPU process to release memory, and if that released memory is contiguous to the memory used by the GPU, then it is useful. This is why sometimes a chunk of physical memory is reserved on command line to the kernel at the start of kernel load: Prior to running random processes this is the only time when it is simple to get contiguous memory (releasing RAM via use of swap has no effect unless the block of RAM is sitting next to the address of the currently used GPU memory address). For a dGPU all GPU VRAM is already contiguous, and CPU processes won’t use GPU VRAM.

@linuxdev this sounds to me more like std::bad_alloc exception than memory allocation timing out. If memory is so fragmented, it is not virtually mapped and so it could not be de-fragmented, then there is just no way to allocate it, right? No amount of waiting should resolve that.

I am looking at case of 1-2 GB out of 8 GB being swapped out, so in most case should be enough place to find space for allocation. Instead I can reliably reproduce problem by using stress RAM command and arguscamersrc timing out on acquiring buffer from the pool. Besides, I expect regular memory exposed only to CPU to definitely be virtually mapped, so that kernel can do some defragmentation if needed and then allocated physical contiguous memory for use by GPU.

I think Nvidia unified memory is quite unique since memory manager needs to satisfy competing requirements from both CPU and GPU. This is why I expecting that only them could really provide definitive answer to these questions.

There is no “reasonable” way to defragment (and in fact there is a security feature of the Linux kernel, if enabled, randomizes kernel space memory and is a detriment to this if it is enabled) the RAM regions involved. It is reasonable to reserve a contiguous block of RAM at the start of the kernel’s life via a command line argument, but “reversing” fragmentation later is not particularly practical. This would more or less be an “out of memory” (or perhaps some form of bad allocation exception, depending on how it is worded or what the point of failure is) family of errors. Once you reach this condition you can’t expect to be able to swap something out and have the problem go away. Waiting might release memory somewhere, but only very very rarely will it be memory which is going to be contiguous to your GPU memory.

You could, if you get very lucky, release memory from a process unrelated to what you are doing, and if that memory is right next to your current GPU memory, combine it for a larger contiguous region. You’d have to be very lucky.

If you consume memory in a predictable and scripted manner it is possible (assuming other processes are not also making this unpredictable) recreate the problem. Scripting various startup processes which consume memory has a significant chance of fragmenting memory the same way each time.

Remember that memory is shared between this iGPU and the CPU. Dedicated VRAM of a dGPU has complete control. The iGPU can only use what is available at the moment it tries to allocate system RAM. The real question is not whether you’ve swapped out other RAM, it is a question of what RAM you have locked in for the iGPU prior to the other swapped process starts using RAM (virtual or otherwise).

I do not remember the kernel command line which reserves RAM for the iGPU, but it does exist. It is very simple to add that argument to a kernel command line. If you have not burned security fuses, and thus the extlinux.conf is used, then you would simply add this to the “APPEND” key/value pair of the extlinux.conf.

Perhaps @AastaLLL knows what the kernel command line is to reserve a RAM region for the iGPU as a contiguous block at the moment the kernel loads.

Hi,

We don’t have document for such information.
But we can check with internal team to see if any thing we can share on the public forum.

Thanks.

@AastaLLL thanks for looking into this! As I mentioned this question already popped before and even the thread here have quite a discussion. There is quite an interest in the topic.

Hi,

Our internal team needs more information to check it further.
Could you share the memory usage breakdown by application with us?

Thanks.

Hi @AastaLLL ,

We are using Jetson Xavier NX, so around 8 GB RAM and 4 GB swap. We have a whole ecosystem of services but most of them are using regular CPU memory, so I would focus only on the application actually using unified memory which is accessed on GPU. After fresh boot of the device and start of all services, the tegrastats gives something around this:

RAM 6599/7771MB (lfb 172x4MB) SWAP 127/3886MB (cached 1MB)
RAM 6599/7771MB (lfb 172x4MB) SWAP 127/3886MB (cached 1MB)
RAM 6599/7771MB (lfb 172x4MB) SWAP 127/3886MB (cached 1MB)
RAM 6599/7771MB (lfb 172x4MB) SWAP 127/3886MB (cached 1MB)
RAM 6599/7771MB (lfb 172x4MB) SWAP 127/3886MB (cached 1MB)

So there is around 1GB of RAM left and swap is pretty much empty. Around 4 GB of RAM is actually mapped for GPU by that single application though:

ubuntu@device:~$ sudo cat /sys/kernel/debug/nvmap/iovmm/clients
CLIENT                        PROCESS      PID        SIZE
user                          app          4534   3984372K
total                                             3984372K

But looking at top, app seems to use another 1.3 GB of RAM of regular CPU memory.

PID  USER      PR   NI VIRT    RES    SHR    S  %CPU %MEM   SWAP TIME+    COMMAND
4534 ubuntu    20   0  13.041g 1.313g 272316 S  49.7 17.3   0    11:15.60 app

This is stable state, buffers are smoothly processed, everything works as expected even if a bit of extra memory used.

The problem really starts when the swap is more populated. For stable reproduction of this, I stress RAM by 2G:

stress --vm 1 --vm-bytes 2G

This swaps a lot of not actively used memory while stress keeps allocating/freeing memory.

RAM 7570/7771MB (lfb 1x256kB) SWAP 1840/3886MB (cached 69MB)
RAM 6803/7771MB (lfb 2x4MB) SWAP 1840/3886MB (cached 69MB)
RAM 7352/7771MB (lfb 6x2MB) SWAP 1840/3886MB (cached 69MB)
RAM 6964/7771MB (lfb 2x4MB) SWAP 1840/3886MB (cached 69MB)
RAM 6211/7771MB (lfb 2x4MB) SWAP 1840/3886MB (cached 69MB)
RAM 6117/7771MB (lfb 5x2MB) SWAP 1840/3886MB (cached 69MB)
RAM 7121/7771MB (lfb 2x4MB) SWAP 1840/3886MB (cached 69MB)
RAM 6374/7771MB (lfb 2x4MB) SWAP 1840/3886MB (cached 69MB)
RAM 5658/7771MB (lfb 2x4MB) SWAP 1840/3886MB (cached 69MB)
RAM 7406/7771MB (lfb 25x4MB) SWAP 1840/3886MB (cached 69MB)
RAM 6601/7771MB (lfb 44x4MB) SWAP 1840/3886MB (cached 69MB)

The app memory though is not swapped much since it is in active use, top output again:

PID  USER      PR   NI VIRT    RES    SHR    S  %CPU %MEM   SWAP TIME+    COMMAND
4534 ubuntu    20   0  13.041g 523592 271176 S  58.9  6.6 855384 20:41.12 app      

But app starts to observe that memory allocation sometimes takes way more than usual:

W20260213 14:28:00.630386 11646 NvidiaPipeline.cpp:142] Last acquire buffer latency: 1.19758 seconds    

Without this extra memory stress, the same line could give me anything from <1 ms to 100 ms in rare worst case scenarios. But anything more than 1 second starts to come dangerously close to various timeouts we have across the app…

For context, this latency is measuring specifically time it took to execute gst_buffer_pool_acquire_buffer using GstNvDsBufferPool. That pool is configured to have 60 to 80 buffers with width 4104 and height 3046. So each buffer is around 4104 x 3046 x 1.5 (NV12) bytes ~ 18 MiB.

And yes, I already checked that this is not due to buffer not being released back to the pool on time. I was trying many times CPU stress to cause our app to artificially stall, I do not observe same problem. Total size of queues in our apps is less than the buffer pool size, so this should not happen anyway.

Just to reiterate and expand based on these details, I would like to know:

Does allocating NVMM memory (by asking buffers fromGstNvDsBufferPool) in this situation of near RAM exhaustion is expected to take time on order of seconds?
Does concurrently allocating/freeing memory by just two applications (stress + app) could cause these allocation to take something around a few seconds?
Are these delays could be related to either de-fragmenting physical memory or just swapping some other unused CPU memory?

Hi,

Thanks for the details of the experiment.

We are discussing this issue with our internal team.
Will update more information with you later.

Thanks.

Hi,

stress + app together can cause the delay and the delays are mainly from reclaim/swapping (and secondarily compaction), not from the NVMM path itself. NVMM is using the same page allocator as the rest of the system, so it is subject to the same reclaim and I/O cost when memory is tight. yes, the allocation under near‑exhaustion can reasonably take seconds.

Thanks.

1 Like

I just want to add that swap adds a lot of time to whatever app uses it. That would be anything running on CPU since the GPU does not use swap.