Hello, I’m trying to allow non-root users to set application clocks for our A100 cards. But it appears the -acp option for nvidia-smi is deprecated. Is there an alternative? I see --auto-boost-permission but that’s not supported for this card (and I specifically want users to be able to use -ac, not --auto-boost-default). Driver version is 525.85.12, and this is an A100-40GB card.
~$ sudo nvidia-smi -acp UNRESTRICTED
Warning: This option is deprecated and will be removed in future releases
Treating as warning and moving on.
All done.
~$ sudo nvidia-smi --auto-boost-permission UNRESTRICTED
Changing auto boosted clocks permissions is not supported for GPU: 00000000:41:00.0.
Treating as warning and moving on.
All done.
Thanks, I tried it out but in the context of a Slurm environment, it allows the user to change the clocks on all GPUs on the current node. It doesn’t restrict the user to the GPUs it allocated, like nvidia-smi does. So I think this opens the door to other issues, like people changing clocks on GPUs they’re not running on.
~/$ dcgmi discovery -l
4 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: Quadro GV100 |
| | PCI Bus ID: 00000000:01:00.0 |
| | Device UUID: GPU-0ff8396c-51a0-9682-41ee-4cc278423f88 |
+--------+----------------------------------------------------------------------+
| 1 | Name: Quadro GV100 |
| | PCI Bus ID: 00000000:25:00.0 |
| | Device UUID: GPU-b6097f37-0603-0a1d-064f-f0a54d98acf1 |
+--------+----------------------------------------------------------------------+
| 2 | Name: NVIDIA A100-PCIE-40GB |
| | PCI Bus ID: 00000000:41:00.0 |
| | Device UUID: GPU-b042a986-47a1-f34f-c63f-aae3c3b89c1c |
+--------+----------------------------------------------------------------------+
| 3 | Name: NVIDIA A100-PCIE-40GB |
| | PCI Bus ID: 00000000:61:00.0 |
| | Device UUID: GPU-dc31e36f-a2d4-5937-0181-4e843a04ad3b |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+
~/$ nvidia-smi
Fri Mar 3 09:55:07 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:61:00.0 Off | Off |
| N/A 31C P0 35W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
My solution was to create a new group called “cuda” and users that belong to this group can run sudo nvidia-smi without a password. I did that by adding a file to /etc/sudoers.d/ with the following:
%cuda ALL=(ALL) NOPASSWD: /usr/bin/nvidia-smi
I’m not sure what the negatives of this approach are. Or rather, how it would be different than if the -acp option was still available. But this works for our environment, for now.