High no. of textures playback, low memory usage and OOM problem

Surki · November 26, 2012, 8:40am

[Sent this to linux-bugs@nvidia.com, but looks like no one is reading that list, wondering if it is even advised to send the bug reports there?]

Problem: When large no. of non-compressed and compressed textures are used,there is a slow increase in low memory usage which eventually invokes OOM killer and application(s) gets killed.

Setup: Tested on Quadro FX 1800, 2000 and 600 cards, on drivers from 256.* on wards to latest 310.19 driver, OpenSuSE 12.1, Kernel 3.1.7 & 3.5.3, 4GB RAM.

Repro:
[ This is a sample program that tries to mimic the behavior of our actual game’s texture usage ]

git clone GitHub - surki/misc: Miscellaneous tools/utilities
cd misc/gl/memtesttexture/
make
./memtest 300 400 # 300 == no. of uncompressed textures, 400 ==
# no. of compressed textures
Wait for sometime for the textures to be loaded and the message"Going to render, press enter to continue" to appear. At this point note down the /proc/meminfo
Wait for sometime, the “memtest” would be killed by OOM

Notes

What the program does
- Create reasonable no. of compressed and non-compressed
  textures. This must make total memory usage go above video memory
  size (so, for Quadro 1800, it must be >= 786MB and for 2000 it
  must be >= 1GB)
- On every frame update, randomly select a group of compressed and
  non-compressed textures and render them
Observations
- The program would get killed via OOM (since it runs out of low
  memory), you can choose how soon or later by choosing the
  appropriate no. of random textures to render.
- This seems to happen only when the video memory is full and
  evictions are happening. We use NVX_gpu_memory_info to monitor the
  evictions from Video RAM.
- When these evictions happen, the low free memory starts decreasing
  rapidly. Possibly the userspace part of the driver starts
  requesting the nvidia kernel module to provide memory (which it
  does by calling __get_free_pages with GFP_KERNEL flag, so the
  pages are allocated from ZONE NORMAL)
- When these evictions happen, for some reason it appears that the
  GL driver caches certain data (we assume this since killing the
  program would release all the memory and low free memory would go
  back to original size).
- This cached data builds up over a period of time and eventually
  blows up after some point.
- If our entire memory usage <= available video memory, we don’t see
  this issue.
- If we modify the (open part of) nvidia.ko kernel driver to return
  RM_ERR_NO_FREE_MEM when low free reaches certain point (say 100 MB
  or so), there is no crash. Though normally this works, but at
  times we have seen, the GL program hangs with high CPU usage,
  stuck somewhere inside the GL driver shared library.
  
  The modification was in nv.c: nv_alloc_pages()