EGL stopped working in slurm+cgroups environment with driver upgrade

Hello
We have a slurm cluster with a bunch of M40 and V100 GPU compute cards and it runs a variety of work. Our machines (dell c4140) have 4 cards each and we uses cgroups so users only see the /dev/nvidia# devices that their job asks for.

We upgraded our drivers from version 418.67 to 460.32.03 and one of our egl programs stopped working. It looks like the eglQueryDevicesEXT or eglGetDisplay call is trying to check all 4 GPUs and it fails after getting permission denied to any of them even though it is only going to use 1 GPU. If we ask for all 4 GPUs from the scheduler so all 4 are in the programs cgroup it works fine. If we ask for 1-3 GPUs so that there is a /dev/nvidia# device that the program can not access it fails. This behavior was not a problem with the older driver. No other changes to the cluster config was made at this time.

Looking at an strace run it looks like the program queries /dev/nvidiactl and then tries to query all 4 of the /dev/nvidia[0-3] devices. Getting permission denied on any of the 4 /dev/nvidia[0-3] devices because it is blocked due to cgroup access seems to be causing the program to fail.

Did something change with EGL behavoir between these driver versions that we need to be aware of?
Is this behavior expected and is it controlled by some GPU setting?

I am a linux admin/engineer that supports the cluster and don’t do any EGL software development myself so please let me know if there is any information I need to include to help find a resolution to this issue.

Thanks.

2 Likes

My users are going to confirm this but they have a machine running 450.51.05 which they say is working with cgroups and will be installing 450.51.06 which we think was broken on some of our machines before we upgraded to the 460.32.03 version.

I was able to remove slurm from the picture and reproduce this issue with a normal user shell and OS cgroups changes.

If we create a cgroup and add just one of the nvidia cards to devices.deny for the testing shell, nvidia-smi will show the remaining cards but the EGLQueryDevicesEXT call fails to find any displays even though I only blocked access to one of three cards in the test system.

I just tested this with driver version 450.102.4 and our eglcheck program SEGFAULTS immediately after accessing the cgroups blocked /dev/nvidia# device.

A quick shorthand of the trace (since our production systems are airgapped) is…

stat(/dev/nvidia2) = 0
open(/dev/nvidia2) = EPERM
ioctl(/dev/nvidiactl, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xaddr ) = 0
SIGSEGV

If this is not the correct place to get technical support please let me know where to go. This is where https://nvidia.custhelp.com sent me with this issue.

Per my second comment. My users were able to test version 450.51.06 on a multi GPU machine and did not see the problem so we still have a rather wide version range between the broken and working driver.

I have not seen any response to this thread yet. Is there some way to confirm that NVIDA has picked it up, or is there some other support avenue we should be using?

Thanks.

1 Like

You can also try to report a bug here:
https://developer.nvidia.com/nvidia_bug/add

Same problem here. Have you found a solution, yet?

Nothing yet.
I posted a bug on the site mentioned above but no response there either.

1 Like

This is really a major issue as egl is basically unusable offscreen with slurm.
Could you please look into it?

They had me test 460.67 and the bug was only partially fixed.

They just posted in my https://developer.nvidia.com/nvidia_bug post that the next release of 460 and 465 should fix the EGL issue. It does not state what version that is or a date for the expected release.

Thanks a lot for following up.

Since I can’t open your bug report, I would be glad if you could let us know when you find a driver that fixes the issues. Good luck!

We also have this issue on our clusters, with Slurm and device Cgroups.

The new versions 460.73.01 and 465.24.02 have been released, did you check with either of those?

Not yet, I need a system administrator to generate a new compute node image with the new drivers, in order to test.

Drivers version 460.73.01. do not fix it.

Ditto… I tested 460.73.01 as well and it is broken in a slightly different manner.
It works when you are accessing the numerically first devices but if cgroups is blocking access to device 0 but allowing access to device 1 the EGL libraries appear to be calling device 1 display 0 and then trying to use device 0 and failing.

Goeffrey, that is exactly our diagnostic too.

https://forums.developer.nvidia.com/t/linux-solaris-and-freebsd-driver-465-27-new-feature-branch-release/176631

I just tested 465.27 and it seems to have solved the problem. I was able to run my test program both with hand created and modified cgroups and within slurm. I am asking our users to test their work on our debug GPU node to verify the real EGL programs run.

Our production cluster is running the 460 line of drivers and I would prefer to stick with this line as long as it is the current production line. We have a lot more projects on this cluster that would have to test these drivers if we were going to push them out to the full cluster. When will 460 get a working EGL release?

When will 460 get a working EGL release?

I believe they fixed the bug in 4.60.80 as it has the same release message as 465.27:

“Fixed a regression that prevented eglQueryDevicesEXT from correctly enumerating GPUs on systems with multiple GPUs where access to the GPU device files was restricted for some GPUs.”

But I have not tried and would be happy if anyone can confirm.

I just tested this and my test program is working with this version. I am contacting my users to have them confirm that it works with their real work as well.

1 Like