Host system: CentOS 6.x (patched up) running custom kernels, tested with 4.9-rc5 and 4.5.7. Some user tools have also been upgraded, such as nsenter, with fresh versions from the upstream sources to accommodate the missing features between CentOS 6 and the kernel version running.
Container software: LXC 1.0.x with user namespaces enabled. Tested with the same host system with Unreal Tournament 2004, and with Ubuntu Trusty running Steam games (tested with Rocket League, Portal 2 and Team Fortress 2).
Previous version of driver used was 352.63. I upgraded to 375.20 and ran into this performance issue in Steam (containerized) games. Also tested driver version and 370.28 and 361.42 with no difference. Kernel version does not matter, except the older versions do not built on 4.9-rc5 so that was skipped. Now I don’t know what happened but it’s a consistent issue.
I’ve switched in and out of a container using nsenter
and toggling User namespace mode (-U) was the one thing that caused or eliminated the issue. Steam runs strictly in a container so I cannot test that component outside of a user namespace.
The “performance issue” in question appears to be rendering of frames with multiple second delays between each. Game audio similarly plays a split-second of sound and goes silent as a result.
My own debugging information gathering
Items of interest are the X server getting hung up running code in function 0x00000000000e00a2 of nvidia_drv.so for large amount of CPU time, measured with perf top -p $(pidof Xorg)
. This is not a constant thing, sometimes games run absurdly slowly with no significant CPU usage.
I have also witnessed games running ioctl(/dev/nvidiactl
, 0xc0104629, 0xfff319e4) and hanging in 100% kernel CPU time for 4 seconds (almost exactly give or take timer error) spending time with the following call stack:
delay_tsc
__udelay
os_delay_us
_nv010245rm
__vdso_clock_gettime
nvidia-bug-report at http://www.dehacked.net/nvidia-bug-report.log.gz
Edit: Version numbers were incorrectly reported, fixed.