I notice the permissions on the /dev/nvidia-caps
are different from all the other nvidia devices, in the sense that other
cannot open / read this directory.
# ll /dev/ | grep nvidia
crw-rw-rw- 1 root root 195, 0 Sep 23 10:23 nvidia0
crw-rw-rw- 1 root root 195, 1 Sep 23 10:23 nvidia1
crw-rw-rw- 1 root root 195, 2 Sep 23 10:23 nvidia2
drwxr-x--- 2 root root 80 Sep 23 10:23 nvidia-caps/
crw-rw-rw- 1 root root 195, 255 Sep 23 10:23 nvidiactl
crw-rw-rw- 1 root root 195, 254 Sep 23 10:23 nvidia-modeset
crw-rw-rw- 1 root root 236, 0 Sep 23 10:23 nvidia-uvm
crw-rw-rw- 1 root root 236, 1 Sep 23 10:23 nvidia-uvm-tools
This is problematic because every time a script tries to access it, it generates an access error:
$ strace -f -e trace=open,access,stat .conda/lib/python3.11/site-packages/wandb/bin/nvidia_gpu_stats 2>&1 | grep EACCES
[pid 62754] stat("/dev/nvidia-caps/nvidia-cap1", 0x7ffc439d3190) = -1 EACCES (Permission denied)
[pid 62754] stat("/dev/nvidia-caps/nvidia-cap1", 0x7ffc439d31c0) = -1 EACCES (Permission denied)
[pid 62754] stat("/dev/nvidia-caps/nvidia-cap2", 0x7ffc439d3190) = -1 EACCES (Permission denied)
[pid 62754] stat("/dev/nvidia-caps/nvidia-cap2", 0x7ffc439d31c0) = -1 EACCES (Permission denied)
and thus a lengthy audit log entry:
type=SYSCALL msg=audit(1727097299.112:183567): arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=7f7a0a5f0000 a2=80000 a3=0 items=1 ppid=53072 pid=53085 auid=1026 uid=1026 gid=1028 euid=1026 suid=1026 fsuid=1026 egid=1028 sgid=1028 fsgid=1028 tty=(none) ses=8 comm="nvidia_gpu_stat" exe=".conda/lib/python3.11/site-packages/wandb/bin/nvidia_gpu_stats" key="access"
And with some scripts trying to access is continuously, our logs get full over night, crashing the machine. (scripts from Weights & Biases).
I’m not sure if it’s the fault of the scripts, or of the permissions on this folder, or both.
What do you suggest ?
Thank you !
nvidia-bug-report.log.gz (771.6 KB)