Major performance regression when games run in a user namespace (aka Unprigileged container)

Host system: CentOS 6.x (patched up) running custom kernels, tested with 4.9-rc5 and 4.5.7. Some user tools have also been upgraded, such as nsenter, with fresh versions from the upstream sources to accommodate the missing features between CentOS 6 and the kernel version running.

Container software: LXC 1.0.x with user namespaces enabled. Tested with the same host system with Unreal Tournament 2004, and with Ubuntu Trusty running Steam games (tested with Rocket League, Portal 2 and Team Fortress 2).

Previous version of driver used was 352.63. I upgraded to 375.20 and ran into this performance issue in Steam (containerized) games. Also tested driver version and 370.28 and 361.42 with no difference. Kernel version does not matter, except the older versions do not built on 4.9-rc5 so that was skipped. Now I don’t know what happened but it’s a consistent issue.

I’ve switched in and out of a container using nsenter and toggling User namespace mode (-U) was the one thing that caused or eliminated the issue. Steam runs strictly in a container so I cannot test that component outside of a user namespace.

The “performance issue” in question appears to be rendering of frames with multiple second delays between each. Game audio similarly plays a split-second of sound and goes silent as a result.

My own debugging information gathering

Items of interest are the X server getting hung up running code in function 0x00000000000e00a2 of nvidia_drv.so for large amount of CPU time, measured with perf top -p $(pidof Xorg). This is not a constant thing, sometimes games run absurdly slowly with no significant CPU usage.

I have also witnessed games running ioctl(/dev/nvidiactl, 0xc0104629, 0xfff319e4) and hanging in 100% kernel CPU time for 4 seconds (almost exactly give or take timer error) spending time with the following call stack:

 delay_tsc
 __udelay 
 os_delay_us
 _nv010245rm
 __vdso_clock_gettime   

nvidia-bug-report at http://www.dehacked.net/nvidia-bug-report.log.gz

Edit: Version numbers were incorrectly reported, fixed.

I’ve started having the same problem on a 361.xx driver version that used to work in the past, so this is likely not actually a regression. Now I am trying to find out what has changed that broke it.

It seems I got my version numbers completely mixed up. Version 352.63 worked, 361.42 (and later) is broken. I will attempt more troubleshooting tomorrow.

First post has been updated with the proper version numbers.

361.42 fails, 358.16 works

To investigate this, we will need a simple reproduction. Can you provide one? No patched tools, no custom kernels, step by step instructions.

Okay, I only have this CentOS 6 system to test with. By “custom kernel” I mean downloaded from kernel.org but no third party modifications.

The only software required is LXC. This was done on LXC 1.0.x but should also work on 1.1.x

This was tested using ut2004, launched from the command line by user “user1”. Due to the way user namespaces remaps UIDs, it will be necessary to chown the user’s $HOME/.ut2004 directory when switching between runs. In this example user1 has uid 500.

LXC may complain about permission issues in setting up its environment, but as long as you get a root shell where the hostname is “userns-mode” it is sufficient.

Working test case:

host# IDSHIFT=0
host# lxc-execute -n userns -s lxc.utsname=userns-mode -s lxc.id_map=“u 0 $IDSHIFT 65000” -s lxc.id_map="g 0 IDSHIFT 65000" /bin/bash userns-mode# su - user1 userns-mode ut2004

Failing test case:

Same as working test case, but set IDSHIFT=1000000 and run “chown -R 1000500 ~user1/.ut2004” as root first. Don’t forget to change ownership back later.

In the failing case the game will produce a single frame of video and appear to freeze, or at least run at an abysmal framerate (such as 0.1 fps)

I upgraded to 375.20 and ran into this performance

Did you try 375.10?
375.20 has serious regression. It might be waste of times to investigate the issues with 375.20.

https://devtalk.nvidia.com/default/topic/977518/linux/problems-with-multiple-opengl-applications-running-simultaneously-with-375-20-on-a-gtx970/2

I tried a rough selection of driver versions from the archives, all with the same base kernel. 358.16 is the last version I tried that worked, 361.42 was the first version I tried that broke. I also tried one specimen from (almost) all major version up to 375 and all failed.

From the test case I described above, my best guess is something goes wrong when the UID of the process is mismatched from what the kernel sees. Being in a user namespace with identical UIDs does nothing out of the ordinary but when they’re remapped suddenly the game stops rendering properly.

Is this issue repro with 375.26 driver ? Is this issue repro without lxc? I mean just install steam and launch game?

Tested with 375.26, it does still have the problem.

This should be reproducible without LXC, but it does require user namespaces. While I have not tried it yet, the unshare command should be all that’s needed to reproduce without LXC, but it’s more involved since you need a second root terminal to participate in setup. LXC simplifies the process of setting them up and is available in most common distros these days. The LXC commands I provided do not require any config file creation and should just run from any root shell immediately after LXC is installed.

My main gaming system has Steam is installed within said user namespace (aka unprivileged) container and games run within it. 358.16 works, 361.42 and all versions I’ve tested afterwards do not. For the sake of troubleshooting I also have a non-container version of Unreal Tournament 2004 tested with and without using the LXC commands above.

My above troubleshooting suggests that the issue is caused by the uid shifting that is involved in an unprivileged container. I’ve tried a few obvious code hacks to the kernel module and LD_PRELOAD hacks to the Xorg server in order to change the apparent uid of running processes, all to no avail.

Hi DeHackEd-v2,
We would like to reproduce this issue to debug further. I never used LXC. Could you please provide step-by-step reproduction steps and explain configuration needed to setup to reproduce this issue? Please provide as much as information in details that will help us to setup reproduction environment.

I’ll build a test Ubuntu machine over the weekend for a clean install. Do you have a preferred version of Ubuntu for this? Or another distro?

My attempts have been thwarted by the only spare GPU I have being a GT200 series which was last supported by version 340. While I can’t properly test this right now, below are my notes for what needs to be done for the version of LXC under Ubuntu and needs of Ubuntu specifically

The items below are my notes. They are not complete yet. I’m missing a critical step that’s giving me errors on Ubuntu, but not CentOS. Trying to figure out what.

Run the following as root:

  • apt-get install lxc usermod -v 1200000-1265535 -w 1200000-1265535 chmod +x /var/lib/lxc mkdir /var/lib/lxc/userns useradd -m -u 1500 -g 1500 lxctest chown 1201500:1201500 -R /home/lxctest

    cat > /var/lib/lxc/userns/config <<EOF
    lxc.utsname = userns-environment
    lxc.rootfs = /
    lxc.id_map = u 0 1200000 65535
    lxc.id_map = g 0 1200000 65535
    EOF

  • xhost + lxc-start -n userns --share-ipc=$$ /bin/bash

Issues:

  • X authentication could be better
  • –share-ipc is fighting back against me

'm missing a critical step that’s giving me errors on Ubuntu,

Hi DeHackEd-v2,
No rush. Please provide detailed info and repro steps so it will save our time to try different different things.

Wasn’t able to get a good (not hacky) install with the limited resources I have, so I just want to say I’m giving up on this. Sorry.

The most noteworthy change introduced with the 361 drivers was glvnd. Don’t know if current drivers still include non-glvnd libraries. At least 361.28 had those.
From the changelog:
"2016-02-09 version 361.28

* Added support for the following GPU:
    * GeForce 945A

* Added a legacy, non-GLVND libGL.so GLX client library to the NVIDIA
  Linux driver installer package, and the ability to select between a
  GLVND or non-GLVND GLX client library at installation time. This
  allows users to install the legacy non-GLVND GLX client library in
  order to work around compatibility issues which may arise due to GLX
  applications which depend upon behaviors of the NVIDIA GLX client
  driver which are not defined by the Linux OpenGL ABI version 1.0.

  By default, nvidia-installer will install the legacy, non-GLVND GLX
  client libraries. The --glvnd-glx-client command line option can be
  used to override the default, and install the GLVND GLX client
  libraries instead. Please contact the vendors of any applications
  that are not compatible with GLVND to ensure that their applications
  be updated for compatibility with GLVND.

  The presence of multiple GLX client libraries in the package has
  implications for repackagers of the NVIDIA driver; see the libGL.so
  entry in the "Installed Components" chapter of the README for details.

2016-01-13 version 361.18

2016-01-05 version 361.16

* The OpenGL Vendor-Neutral Driver (GLVND) infrastructure is now included
  and supported by the NVIDIA GLX and OpenGL drivers.  This should not
  cause any visible changes in behavior for end users, but some internal
  driver component libraries have been renamed and/or moved as a result.
  These changes may affect scripts that rely on the presence of NVIDIA
  OpenGL driver components other than those specified in the Linux OpenGL
  ABI version 1.0, maintainers of alternative NVIDIA driver installation
  packages, and applications which rely on the presence of any non-
  OpenGL/GLX symbols in the libGL.so.1 library and its dependencies in
  any way.

  Please see:
  
    https://github.com/NVIDIA/libglvnd

  For more information on the GLVND project.

  The Linux OpenGL ABI version 1.0 specification is available at:

    https://www.opengl.org/registry/ABI"

This issue would - I suspect - be caused by either something in the kernel driver where the cred structure has something unexpected in it, or in X11 where the userid of the connecting/rendering process doesn’t match some expected or desired value.

User namespaces do two major things:

  1. Root privileges (the capable(…) call) all fail unless the variant ns_capable(…) is used with a suitable user namespace passed

  2. Process-visible UIDs and GIDs vary depending on what user namespace the querying process is a member of. This allows a container to think a process runs as uid 0 (root) while the host sees it as a non-root process.

I’m sure nVidia has all kinds of little under-the-hood changes between major versions which might cause this, but I can’t go any further so I’m just going to give up here. Thanks for trying.