MIG permissions confusion (A100, driver 470.256.02)

Hi all –

I am experimenting with MIG on a quad-A100 system, and have worked through most of the material in the user guide. As root, I can turn on MIG mode, create a GPU instance, and create a Compute Instance, and see them all with the appropriate arguments to nvidia-smi.

I am interested in the “bare metal” use-case for non-privileged users.

Where I want to go next is, I want non-privileged users to be able to use the CIs to run things under CUDA. I had imagined that the way this would work is, users could see available CIs in the output of nvidia-smi -L, and I’d set up some kind of book-keeping mechanism to ensure CIs are not oversubscribed, users would set CUDA_VISIBLE_DEVICES appropriately, and run their tasks.

But it seems that after setting up the CI, it’s not visible to non-privileged users? nvidia-smi -L as a regular user just shows the cards, and no MIG units, and nvidia-smi with no arguments shows MIG enabled on the first card (the only one I did it on), and has the table for MIG devices, but it’s empty.

The docs say that the relevant permissions are for the /proc/driver/nvidia/capabilties/mig/config and /proc/driver/nvidia/capabilities/mig/monitor. These are both readable by the regular user.

What am I missing? Is the bare-metal use-case only for the root user? Is there somewhere else where I need to set the permissions?

Thanks in advance.

Hi again all –

I think I have a partial answer. Per the documentation, I can determine that e.g. for the “compute instance” I created, the permissions are right. The device is present at /dev/nvidia-caps/nvidia-cap49, and I can confirm that it has the right major:minor for the ci, because there is a corresponding entry under /proc/driver/nvidia/capabilities/gpu0/mig/gi5/ci0/access, which is a text file that lists the minor (among other things). This corresponds to the GI and CI IDs that were used at creation time.

The problem seems to be that the parent directory of the device, /dev/nvidia-caps, is too restrictive, it has 700 permissions, so even though /dev/nvidia-caps/nvidia-cap49 is readable by unprivileged users, it’s not visisble, so things don’t work.

Manually opening permissions on /dev/nvidia-caps makes the CI visible to unprivileged users running nvidia-smi -L.

I do note I found what seem to be inconsistencies in the documentation, which threw me. The 470.256.02 driver does not appear to have the nv_cap_enable_devfs parameter that is described in the documentation, but is in fact using the devfs permissoins model. I’m still not sure how to control permissions of /dev/nvidia-caps, or the implications of opening it up, but my immediate question is answered.