EGL stopped working in slurm+cgroups environment with driver upgrade

geoffrey.ransom · March 4, 2021, 4:06am

Hello
We have a slurm cluster with a bunch of M40 and V100 GPU compute cards and it runs a variety of work. Our machines (dell c4140) have 4 cards each and we uses cgroups so users only see the /dev/nvidia# devices that their job asks for.

We upgraded our drivers from version 418.67 to 460.32.03 and one of our egl programs stopped working. It looks like the eglQueryDevicesEXT or eglGetDisplay call is trying to check all 4 GPUs and it fails after getting permission denied to any of them even though it is only going to use 1 GPU. If we ask for all 4 GPUs from the scheduler so all 4 are in the programs cgroup it works fine. If we ask for 1-3 GPUs so that there is a /dev/nvidia# device that the program can not access it fails. This behavior was not a problem with the older driver. No other changes to the cluster config was made at this time.

Looking at an strace run it looks like the program queries /dev/nvidiactl and then tries to query all 4 of the /dev/nvidia[0-3] devices. Getting permission denied on any of the 4 /dev/nvidia[0-3] devices because it is blocked due to cgroup access seems to be causing the program to fail.

Did something change with EGL behavoir between these driver versions that we need to be aware of?
Is this behavior expected and is it controlled by some GPU setting?

I am a linux admin/engineer that supports the cluster and don’t do any EGL software development myself so please let me know if there is any information I need to include to help find a resolution to this issue.

Thanks.

geoffrey.ransom · March 4, 2021, 11:41pm

My users are going to confirm this but they have a machine running 450.51.05 which they say is working with cgroups and will be installing 450.51.06 which we think was broken on some of our machines before we upgraded to the 460.32.03 version.

I was able to remove slurm from the picture and reproduce this issue with a normal user shell and OS cgroups changes.

If we create a cgroup and add just one of the nvidia cards to devices.deny for the testing shell, nvidia-smi will show the remaining cards but the EGLQueryDevicesEXT call fails to find any displays even though I only blocked access to one of three cards in the test system.

geoffrey.ransom · March 5, 2021, 6:42pm

I just tested this with driver version 450.102.4 and our eglcheck program SEGFAULTS immediately after accessing the cgroups blocked /dev/nvidia# device.

A quick shorthand of the trace (since our production systems are airgapped) is…

stat(/dev/nvidia2) = 0
open(/dev/nvidia2) = EPERM
ioctl(/dev/nvidiactl, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xaddr ) = 0
SIGSEGV

If this is not the correct place to get technical support please let me know where to go. This is where https://nvidia.custhelp.com sent me with this issue.

geoffrey.ransom · March 10, 2021, 9:13pm

Per my second comment. My users were able to test version 450.51.06 on a multi GPU machine and did not see the problem so we still have a rather wide version range between the broken and working driver.

I have not seen any response to this thread yet. Is there some way to confirm that NVIDA has picked it up, or is there some other support avenue we should be using?

Thanks.

generix · March 10, 2021, 9:43pm

You can also try to report a bug here:
https://developer.nvidia.com/nvidia_bug/add

msundermeyer42 · March 12, 2021, 4:19pm

Same problem here. Have you found a solution, yet?

geoffrey.ransom · March 16, 2021, 6:09pm

Nothing yet.
I posted a bug on the site mentioned above but no response there either.

msundermeyer42 · March 30, 2021, 10:05am

This is really a major issue as egl is basically unusable offscreen with slurm.
Could you please look into it?

geoffrey.ransom · April 7, 2021, 5:35pm

They had me test 460.67 and the bug was only partially fixed.

They just posted in my https://developer.nvidia.com/nvidia_bug post that the next release of 460 and 465 should fix the EGL issue. It does not state what version that is or a date for the expected release.

msundermeyer42 · April 7, 2021, 5:58pm

Thanks a lot for following up.

Since I can’t open your bug report, I would be glad if you could let us know when you find a driver that fixes the issues. Good luck!

mboisson · April 21, 2021, 3:16pm

We also have this issue on our clusters, with Slurm and device Cgroups.

generix · April 21, 2021, 3:26pm

The new versions 460.73.01 and 465.24.02 have been released, did you check with either of those?

mboisson · April 21, 2021, 3:48pm

Not yet, I need a system administrator to generate a new compute node image with the new drivers, in order to test.

mboisson · April 21, 2021, 7:01pm

Drivers version 460.73.01. do not fix it.

geoffrey.ransom · April 21, 2021, 8:04pm

Ditto… I tested 460.73.01 as well and it is broken in a slightly different manner.
It works when you are accessing the numerically first devices but if cgroups is blocking access to device 0 but allowing access to device 1 the EGL libraries appear to be calling device 1 display 0 and then trying to use device 0 and failing.

mboisson · April 23, 2021, 6:48pm

Goeffrey, that is exactly our diagnostic too.

generix · April 30, 2021, 7:19am

https://forums.developer.nvidia.com/t/linux-solaris-and-freebsd-driver-465-27-new-feature-branch-release/176631

geoffrey.ransom · May 4, 2021, 8:28pm

I just tested 465.27 and it seems to have solved the problem. I was able to run my test program both with hand created and modified cgroups and within slurm. I am asking our users to test their work on our debug GPU node to verify the real EGL programs run.

Our production cluster is running the 460 line of drivers and I would prefer to stick with this line as long as it is the current production line. We have a lot more projects on this cluster that would have to test these drivers if we were going to push them out to the full cluster. When will 460 get a working EGL release?

msundermeyer42 · May 19, 2021, 4:19pm

When will 460 get a working EGL release?

I believe they fixed the bug in 4.60.80 as it has the same release message as 465.27:

“Fixed a regression that prevented eglQueryDevicesEXT from correctly enumerating GPUs on systems with multiple GPUs where access to the GPU device files was restricted for some GPUs.”

But I have not tried and would be happy if anyone can confirm.

geoffrey.ransom · May 19, 2021, 5:41pm

I just tested this and my test program is working with this version. I am contacting my users to have them confirm that it works with their real work as well.