High no. of textures playback, low memory usage and OOM problem

[Sent this to linux-bugs@nvidia.com, but looks like no one is reading that list, wondering if it is even advised to send the bug reports there?]

Problem: When large no. of non-compressed and compressed textures are used,there is a slow increase in low memory usage which eventually invokes OOM killer and application(s) gets killed.

Setup: Tested on Quadro FX 1800, 2000 and 600 cards, on drivers from 256.* on wards to latest 310.19 driver, OpenSuSE 12.1, Kernel 3.1.7 & 3.5.3, 4GB RAM.

Repro:
[ This is a sample program that tries to mimic the behavior of our actual game’s texture usage ]

  1. git clone GitHub - surki/misc: Miscellaneous tools/utilities
  2. cd misc/gl/memtesttexture/
  3. make
  4. ./memtest 300 400 # 300 == no. of uncompressed textures, 400 ==
    # no. of compressed textures
  5. Wait for sometime for the textures to be loaded and the message"Going to render, press enter to continue" to appear. At this point note down the /proc/meminfo
  6. Wait for sometime, the “memtest” would be killed by OOM
  • Notes

    What the program does

    • Create reasonable no. of compressed and non-compressed
      textures. This must make total memory usage go above video memory
      size (so, for Quadro 1800, it must be >= 786MB and for 2000 it
      must be >= 1GB)

    • On every frame update, randomly select a group of compressed and
      non-compressed textures and render them

  • Observations

    • The program would get killed via OOM (since it runs out of low
      memory), you can choose how soon or later by choosing the
      appropriate no. of random textures to render.

    • This seems to happen only when the video memory is full and
      evictions are happening. We use NVX_gpu_memory_info to monitor the
      evictions from Video RAM.

    • When these evictions happen, the low free memory starts decreasing
      rapidly. Possibly the userspace part of the driver starts
      requesting the nvidia kernel module to provide memory (which it
      does by calling __get_free_pages with GFP_KERNEL flag, so the
      pages are allocated from ZONE NORMAL)

    • When these evictions happen, for some reason it appears that the
      GL driver caches certain data (we assume this since killing the
      program would release all the memory and low free memory would go
      back to original size).

    • This cached data builds up over a period of time and eventually
      blows up after some point.

    • If our entire memory usage <= available video memory, we don’t see
      this issue.

    • If we modify the (open part of) nvidia.ko kernel driver to return
      RM_ERR_NO_FREE_MEM when low free reaches certain point (say 100 MB
      or so), there is no crash. Though normally this works, but at
      times we have seen, the GL program hangs with high CPU usage,
      stuck somewhere inside the GL driver shared library.

      The modification was in nv.c: nv_alloc_pages()