We have a slurm cluster with a bunch of M40 and V100 GPU compute cards and it runs a variety of work. Our machines (dell c4140) have 4 cards each and we uses cgroups so users only see the /dev/nvidia# devices that their job asks for.
We upgraded our drivers from version 418.67 to 460.32.03 and one of our egl programs stopped working. It looks like the eglQueryDevicesEXT or eglGetDisplay call is trying to check all 4 GPUs and it fails after getting permission denied to any of them even though it is only going to use 1 GPU. If we ask for all 4 GPUs from the scheduler so all 4 are in the programs cgroup it works fine. If we ask for 1-3 GPUs so that there is a /dev/nvidia# device that the program can not access it fails. This behavior was not a problem with the older driver. No other changes to the cluster config was made at this time.
Looking at an strace run it looks like the program queries /dev/nvidiactl and then tries to query all 4 of the /dev/nvidia[0-3] devices. Getting permission denied on any of the 4 /dev/nvidia[0-3] devices because it is blocked due to cgroup access seems to be causing the program to fail.
Did something change with EGL behavoir between these driver versions that we need to be aware of?
Is this behavior expected and is it controlled by some GPU setting?
I am a linux admin/engineer that supports the cluster and don’t do any EGL software development myself so please let me know if there is any information I need to include to help find a resolution to this issue.