Ncu returns ERR_NVGPUCTRPERM for cap_sys_admin users

I’ve read through other posts seeing this error that I think had a range of other causes- I am able to avoid it by setting module option:
NVreg_RestrictProfilingToAdminUsers=0

However- I’m trying to use a solution I had in place (that worked) in the past (some driver versions and OS updates ago, not sure when it broke).

What I had successfully working before was:
NVreg_RestrictProfilingToAdminUsers=1 and I had the user capability cap_perfmon enabled.

cap_perfmon was mentioned as an alternative in some documentation I read, which is why I tried it originally, and it worked. Realizing that was no longer working, I saw the instructions mentioned to use cap_sys_admin so I tried enabling that capability too- still, it did not work.

I am hoping I can get this to work again via the least capability grant possible (cap_perfmon), but I’d settle to even have cap_sys_admin working for a more granular approach than everyone.

FWIW, I’m using:

  • RHEL 8.8
  • Linux Kernel Version = 4.18.0-477.36.1.el8_8.x86_64
  • Nvidia Driver Version: 545.23.08
  • NCU Version 2022.2.0.0 (build 31140043) - OR -
  • NCU Version 2023.2.1.0 (build 33050884) - same result

The fact that I can set the module option to “0” and have success should validate that I’m successfully unloading/reloading the driver with correct options. Additionally confirming my environment:
[jenos@hostname ~]$ capsh --print |grep Current:
Current: cap_sys_admin=i
[jenos@hostname ~]$ cat /proc/driver/nvidia/params |grep -i prof
RmProfilingAdminOnly: 1

From what I can tell, this is correctly configured and should work. Any ideas?
thanks-

Jeremy

Hi, @jenos

Thanks for reporting this ! But I am afraid this can not be guaranteed by us as these are system settings. It seems cap_sys_admin=i in your system didn’t give you enough permission to control the GPU device.

I’m not sure what to make of that- I’m following step 3 in the (linked) instructions provided by Nvidia. Also, this previously worked even with just cap_perfmon. The system has changed as it has received updates- both from OS and from Nvidia. Is this same setup still working with latest Nvidia test environments?

Hi, @jenos

Actually, I am not sure about cap_perfmon working or not,we actually never tried this setting internally.

Regarding cap_sys_admin, can you clarify the steps you use to set this ?

Sure-
This is a slurm environment that I want the capability in, so I’m using PAM (pam_cap module) to apply it:
[root@hostname ~]# cat /etc/pam.d/slurm
#%PAM-1.0
account required pam_unix.so
account required pam_slurm.so
auth required pam_localuser.so
auth required pam_cap.so
session required pam_limits.so
[root@hostname ~]# cat /etc/security/capability.conf
^cap_sys_admin jenos
none *

[jenos@hostname ~]$ capsh --print |grep -v groups
Current: cap_sys_admin=i
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore
Ambient set =
Current IAB: cap_sys_admin
Securebits: 00/0x0/1’b0 (no-new-privs=0)
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
secure-no-ambient-raise: no (unlocked)
uid=20152(jenos) euid=20152(jenos)
gid=202(grp_202)
Guessed mode: UNCERTAIN (0)

Hi, @jenos

I see you also logged a bug in our internal system. Our dev is already on it. Any update will be communicated there directly. Thanks !

I did, yes- sorry for the duplicate efforts. I’ll continue iterating just on the bug.

OK, thanks !
Feel free to reach us if you need help !