Hi @AastaLLL ,
We are using Jetson Xavier NX, so around 8 GB RAM and 4 GB swap. We have a whole ecosystem of services but most of them are using regular CPU memory, so I would focus only on the application actually using unified memory which is accessed on GPU. After fresh boot of the device and start of all services, the tegrastats gives something around this:
RAM 6599/7771MB (lfb 172x4MB) SWAP 127/3886MB (cached 1MB)
RAM 6599/7771MB (lfb 172x4MB) SWAP 127/3886MB (cached 1MB)
RAM 6599/7771MB (lfb 172x4MB) SWAP 127/3886MB (cached 1MB)
RAM 6599/7771MB (lfb 172x4MB) SWAP 127/3886MB (cached 1MB)
RAM 6599/7771MB (lfb 172x4MB) SWAP 127/3886MB (cached 1MB)
So there is around 1GB of RAM left and swap is pretty much empty. Around 4 GB of RAM is actually mapped for GPU by that single application though:
ubuntu@device:~$ sudo cat /sys/kernel/debug/nvmap/iovmm/clients
CLIENT PROCESS PID SIZE
user app 4534 3984372K
total 3984372K
But looking at top, app seems to use another 1.3 GB of RAM of regular CPU memory.
PID USER PR NI VIRT RES SHR S %CPU %MEM SWAP TIME+ COMMAND
4534 ubuntu 20 0 13.041g 1.313g 272316 S 49.7 17.3 0 11:15.60 app
This is stable state, buffers are smoothly processed, everything works as expected even if a bit of extra memory used.
The problem really starts when the swap is more populated. For stable reproduction of this, I stress RAM by 2G:
stress --vm 1 --vm-bytes 2G
This swaps a lot of not actively used memory while stress keeps allocating/freeing memory.
RAM 7570/7771MB (lfb 1x256kB) SWAP 1840/3886MB (cached 69MB)
RAM 6803/7771MB (lfb 2x4MB) SWAP 1840/3886MB (cached 69MB)
RAM 7352/7771MB (lfb 6x2MB) SWAP 1840/3886MB (cached 69MB)
RAM 6964/7771MB (lfb 2x4MB) SWAP 1840/3886MB (cached 69MB)
RAM 6211/7771MB (lfb 2x4MB) SWAP 1840/3886MB (cached 69MB)
RAM 6117/7771MB (lfb 5x2MB) SWAP 1840/3886MB (cached 69MB)
RAM 7121/7771MB (lfb 2x4MB) SWAP 1840/3886MB (cached 69MB)
RAM 6374/7771MB (lfb 2x4MB) SWAP 1840/3886MB (cached 69MB)
RAM 5658/7771MB (lfb 2x4MB) SWAP 1840/3886MB (cached 69MB)
RAM 7406/7771MB (lfb 25x4MB) SWAP 1840/3886MB (cached 69MB)
RAM 6601/7771MB (lfb 44x4MB) SWAP 1840/3886MB (cached 69MB)
The app memory though is not swapped much since it is in active use, top output again:
PID USER PR NI VIRT RES SHR S %CPU %MEM SWAP TIME+ COMMAND
4534 ubuntu 20 0 13.041g 523592 271176 S 58.9 6.6 855384 20:41.12 app
But app starts to observe that memory allocation sometimes takes way more than usual:
W20260213 14:28:00.630386 11646 NvidiaPipeline.cpp:142] Last acquire buffer latency: 1.19758 seconds
Without this extra memory stress, the same line could give me anything from <1 ms to 100 ms in rare worst case scenarios. But anything more than 1 second starts to come dangerously close to various timeouts we have across the app…
For context, this latency is measuring specifically time it took to execute gst_buffer_pool_acquire_buffer using GstNvDsBufferPool. That pool is configured to have 60 to 80 buffers with width 4104 and height 3046. So each buffer is around 4104 x 3046 x 1.5 (NV12) bytes ~ 18 MiB.
And yes, I already checked that this is not due to buffer not being released back to the pool on time. I was trying many times CPU stress to cause our app to artificially stall, I do not observe same problem. Total size of queues in our apps is less than the buffer pool size, so this should not happen anyway.
Just to reiterate and expand based on these details, I would like to know:
Does allocating NVMM memory (by asking buffers fromGstNvDsBufferPool) in this situation of near RAM exhaustion is expected to take time on order of seconds?
Does concurrently allocating/freeing memory by just two applications (stress + app) could cause these allocation to take something around a few seconds?
Are these delays could be related to either de-fragmenting physical memory or just swapping some other unused CPU memory?