331.20 performance drop when VRAM full / PCIe Bandwith Utilization exceeds / NVRM errors

Hi,

I am running Arch linux x86_64, kernel 3.12, and nvidia 331.20, GeForce GTX 650Ti Boost 1GB from Gainward. I verified this issue with several desktop environments, several display setups (single / dual monitor), and several OpenGL Settings within nvidia X Server Settings GUI.

My observed issue occurs when running Unigine valley or Unigine Heaven benchmark (I did not test any other 3D games / apps).

It starts quite smooth (~20FPS), performance equal to Unigine benchmark in Windows/OpenGL. At scene 8 or 9 (valley) or scene 5-7 (heaven), the performance suddenly drops massively and does not recover any more. The first scene only plays with around 8-10FPS, until the benchmark is closed and restarted.

Three strange things happen exactly when performance drops:

  1. Within the Nvidia Settings Manager, I noticed that the PCIe Bandwidth Utilization suddenly raised from about 1% to 60-70% (and will stay at that level until the benchmark is closed).
  2. When it happens, the VRAM is 100% full (1024MB/1024MB) and will remain like that until benchmark is closed.
  3. Lots of messages in dmesg, which look like this:

332.094640] NVRM: GPU at 0000:01:00: GPU-ef8b5144-a87a-ca75-07cf-3c58e3bb763c
[ 332.094648] NVRM: Xid (0000:01:00): 31, Ch 00000001, engmask 00000101, intr 10000000
[ 332.205607] NVRM: Xid (0000:01:00): 31, Ch 00000001, engmask 00000111, intr 10000000
[ 333.029376] NVRM: Xid (0000:01:00): 31, Ch 00000001, engmask 00000101, intr 10000000
[ 333.060179] NVRM: Xid (0000:01:00): 13, 0001 00000000 0000a097 00001614 00000000 0000000d
[ 333.106944] NVRM: Xid (0000:01:00): 31, Ch 00000001, engmask 00000111, intr 10000000
[ 333.140734] NVRM: Xid (0000:01:00): 13, 0001 00000000 0000a097 00001614 00000000 0000000d
[ 333.154885] NVRM: Xid (0000:01:00): 31, Ch 00000001, engmask 00000101, intr 10000000
[ 333.187125] NVRM: Xid (0000:01:00): 13, 0001 00000000 0000a097 00001614 00000000 0000000d
[ 333.197998] NVRM: Xid (0000:01:00): 13, 0001 00000000 0000a097 00001614 00000000 0000000d
[ 333.226076] NVRM: Xid (0000:01:00): 31, Ch 00000001, engmask 00000101, intr 10000000
[ 333.267744] NVRM: Xid (0000:01:00): 31, Ch 00000001, engmask 00000101, intr 10000000
[ 333.287152] NVRM: Xid (0000:01:00): 13, 0001 00000000 0000a097 00001614 00000000 0000000d
[ 333.301732] NVRM: Xid (0000:01:00): 31, Ch 00000001, engmask 00000101, intr 10000000
[ 333.319120] NVRM: Xid (0000:01:00): 13, 0001 00000000 0000a097 00001614 00000000 0000000d
[ 333.347373] NVRM: Xid (0000:01:00): 31, Ch 00000001, engmask 00000101, intr 10000000
[ 333.362109] NVRM: Xid (0000:01:00): 13, 0001 00000000 0000a097 00001614 00000000 0000000d
[ 333.379526] NVRM: Xid (0000:01:00): 31, Ch 00000001, engmask 00000101, intr 10000000
[ 333.393192] NVRM: Xid (0000:01:00): 13, 0001 00000000 0000a097 00001614 00000000 0000000d
[ 333.406094] NVRM: Xid (0000:01:00): 31, Ch 00000001, engmask 00000101, intr 10000000
[ 333.416980] NVRM: Xid (0000:01:00): 13, 0001 00000000 0000a097 00001614 00000000 0000000d
[ 333.432796] NVRM: Xid (0000:01:00): 31, Ch 00000001, engmask 00000101, intr 10000000

Hoping for a developer to take a look at this issue. If you need more information, just ask me. I think that it looks like the VRAM management has a bug?

Thanks and regards
vibee

I did not see any attach-file-function here, so I uploaded the bug report file to my server: http://vibee.de/nvidia-bug-report.log.gz

After searching for this issue, I found out that my issue is identical to the one described in this thread by 15+ people: https://devtalk.nvidia.com/default/topic/529521/

So seriously, this issue is known for months and has been reported by several users, an NVIDIA does not care? Come on!

I’ve had this for awhile now in L4D2. That’s why I still play on Windows.

I have filed a bug report to NVIDIA, and linked to this thread. Hopyfully they will take a look for and fix it. :)

You could help if you also filed a bug report: http://www.nvidia.com/object/driverqualityassurance.html

I posted a report to Valve awhile ago if NVIDIA wants to view the details:

https://github.com/ValveSoftware/Source-1-Games/issues/519#issuecomment-27270527

The problem is that NVIDIA won’t read that thread, they don’t even read this forum. We will only be noticed if we file detailed bug reports directly to NVIDIA. It might be enough to copy-paste your post to the bug file form I linked above. :)

However it’s good to know that the problem occurs with Ubuntu too. I was just about to install Ubuntu to check this.

By the way, it seems to be the same problem as described here by tons of people:


And all the other threads in Valve forum linked in these threads. I’m wondering why Nvidia did not respond yet. Maybe they are not able to fix the bug?

Considering how prominent the issue is on the Valve issue tracker, and the fact that NVIDIA has engineers embedded at Valve, I have no doubt that they are aware of the problem and working on a fix.

The fact that a fix has not been issued yet could indicate that it’s just a lot of work, or has to be coordinated with the Windows team or upstream projects, etc.

Hopefully the solution won’t be “buy a GPU with more VRAM”. :)

You’re probably right. I was confused, cause each open issue at valve bug tracker is “unassigned”.

I will just keep on playing with Windows and doing anything else with Linux, hoping for a fix… Another solution, besides “buy a GPU with more VRAM”, actually is “buy an AMD GPU” ;) (Although I swore not to by another AMD card due to the even worse Linux drivers…).

From whole story looks like the Unigine benchmark initially start with some high frame rate/fps and after sometime or on some scenes frame rate value drops to 7/10 fps. And this dropped fps remain stable until you close or restart Unigine benchmark. Correct me if I am missing something.

Please provide below information:

  • Please attach nvidia bug report to this forum thread (as other web links are blocked internally).
  • Is this issue still repro if you unchecked “sync to vblank” in nvidia-settings opengl settings ?
  • Is any previous driver helped you?
  • Is the issue repro if you disable composite in xorg.conf?
  • Is the issue repro if you disable desktop effects/compiz/kwin etc ?
  • Is the issue repro on bare X/Xorg ?

The 331.38 change log had me hopeful but L4D2 is still a broken mess for me.

And it introduces a ton of performance problems and input lag.