[Sent this to firstname.lastname@example.org, but looks like no one is reading that list, wondering if it is even advised to send the bug reports there?]
Problem: When large no. of non-compressed and compressed textures are used,there is a slow increase in low memory usage which eventually invokes OOM killer and application(s) gets killed.
Setup: Tested on Quadro FX 1800, 2000 and 600 cards, on drivers from 256.* on wards to latest 310.19 driver, OpenSuSE 12.1, Kernel 3.1.7 & 3.5.3, 4GB RAM.
[ This is a sample program that tries to mimic the behavior of our actual game’s texture usage ]
- git clone https://github.com/surki/misc.git
- cd misc/gl/memtesttexture/
- ./memtest 300 400 # 300 == no. of uncompressed textures, 400 ==
# no. of compressed textures
- Wait for sometime for the textures to be loaded and the message"Going to render, press enter to continue" to appear. At this point note down the /proc/meminfo
- Wait for sometime, the “memtest” would be killed by OOM
What the program does
Create reasonable no. of compressed and non-compressed
textures. This must make total memory usage go above video memory
size (so, for Quadro 1800, it must be >= 786MB and for 2000 it
must be >= 1GB)
On every frame update, randomly select a group of compressed and
non-compressed textures and render them
The program would get killed via OOM (since it runs out of low
memory), you can choose how soon or later by choosing the
appropriate no. of random textures to render.
This seems to happen only when the video memory is full and
evictions are happening. We use NVX_gpu_memory_info to monitor the
evictions from Video RAM.
When these evictions happen, the low free memory starts decreasing
rapidly. Possibly the userspace part of the driver starts
requesting the nvidia kernel module to provide memory (which it
does by calling __get_free_pages with GFP_KERNEL flag, so the
pages are allocated from ZONE NORMAL)
When these evictions happen, for some reason it appears that the
GL driver caches certain data (we assume this since killing the
program would release all the memory and low free memory would go
back to original size).
This cached data builds up over a period of time and eventually
blows up after some point.
If our entire memory usage <= available video memory, we don’t see
If we modify the (open part of) nvidia.ko kernel driver to return
RM_ERR_NO_FREE_MEM when low free reaches certain point (say 100 MB
or so), there is no crash. Though normally this works, but at
times we have seen, the GL program hangs with high CPU usage,
stuck somewhere inside the GL driver shared library.
The modification was in nv.c: nv_alloc_pages()