Cuda errors when permissions on /dev/nvidia* are not 666

Hi,
I am managing a GPU cluster. Each node has 8 K20 GPUs. We share nodes between multiple users. We use Torque as RM. It is important for us to make sure that one user does not access someone else’s GPU.

We originally thought to set permissions on /dev/nvidiaX so that they are in 600, and owned by the user/group that needs to access it. However, this does not work. I get cuda errors unless all devices are in 666.

For example, another user is running on the compute node. We have permissions set to
[mboisson@gpu-k20-02 release]$ ls -lh /dev/nvidia*
crw-rw---- 1 root root 246, 0 Jun 20 10:43 /dev/nvidia-uvm
crw-rw---- 1 sergeyev root 195, 0 Jun 19 11:38 /dev/nvidia0
crw-rw---- 1 sergeyev root 195, 1 Jun 19 11:38 /dev/nvidia1
crw-rw---- 1 sergeyev root 195, 2 Jun 19 11:38 /dev/nvidia2
crw-rw---- 1 sergeyev root 195, 3 Jun 19 11:38 /dev/nvidia3
crw-rw---- 1 mboisson root 195, 4 Jun 19 11:38 /dev/nvidia4
crw-rw---- 1 mboisson root 195, 5 Jun 19 11:38 /dev/nvidia5
crw-rw---- 1 mboisson root 195, 6 Jun 19 11:38 /dev/nvidia6
crw-rw---- 1 mboisson root 195, 7 Jun 19 11:38 /dev/nvidia7
crw-rw---- 1 root root 195, 8 Jun 19 11:38 /dev/nvidia8
crw-rw---- 1 root root 195, 9 Jun 19 11:38 /dev/nvidia9
crw-rw-rw- 1 root root 195, 255 Jun 19 11:38 /dev/nvidiactl

User mboisson has
[mboisson@gpu-k20-02 release]$ env | grep CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=4,5,6,7

Yet, running any cuda application ends up in an error. For example, the deviceQuery sample gives :
[mboisson@gpu-k20-02 release]$ ./deviceQuery
./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 38
→ no CUDA-capable device is detected
Result = FAIL

Why is that not working ?

If you’re using Torque/Moab, I think Moab will handle this for you. Contact Adaptive.

If you’re using plain Torque, I think the usual approach:

[url]http://gehrcke.de/2013/09/setting-up-torque-pbs-for-gpu-job-scheduling/[/url]

is to set CUDA_VISIBLE_DEVICES in a job preamble script (based on the requested GPUs and those assigned by Torque), and assume that there are no rogue users who will unset/reset this environment variable.

Are niether of these applicable to your case?

I have seen the permissions method work “successfully” at a university I consulted at, but there were other somewhat undesirable side effects that I don’t recall at the moment. I would have to refresh my memory which would require substantial effort. The permissions of the devices had to be modified at each job launch.

In your case, for the specific example shown, what would happen if you set CUDA_VISIBLE_DEVICES to “0-3” instead of “4-7” ?

Hi,
Torque does set CUDA_VISIBLE_DEVICES, but this is modifiable by the user. The assumption that no rogue user will change this, or simply ignore it (through some creative way) is not a satisfying one.

This is like assuming that jobs will not take more memory than requested, or that they will not request 1 core and use 10. This can be and is constrained using cpusets and cgroups.

In the specific example, if I set CUDA_VISIBLE_DEVICES to 0-3, it would (and should) fail.

Maxime

Oh… actually, I just tested, and it DOES work.

So, I guess the problem is that if we are to set permissions like that, Torque always needs to set CUDA_VISIBLE_DEVICES starting at 0.

Interresting. I will see if I can make this work.

Maxime

FWIW, the CUDA Toolkit Getting Started guide for Linux suggests having 0666 permissions on the node links:
[url]CUDA Toolkit Documentation

That being said, it’s understandable that you’re trying to limit it to certain users/groups and perhaps you can find a workaround.

sudo apt-get install nvidia-modprobe

will fix the issue :)