Cuda errors when permissions on /dev/nvidia* are not 666

mboisson · June 20, 2014, 3:39pm

Hi,
I am managing a GPU cluster. Each node has 8 K20 GPUs. We share nodes between multiple users. We use Torque as RM. It is important for us to make sure that one user does not access someone else’s GPU.

We originally thought to set permissions on /dev/nvidiaX so that they are in 600, and owned by the user/group that needs to access it. However, this does not work. I get cuda errors unless all devices are in 666.

For example, another user is running on the compute node. We have permissions set to
[mboisson@gpu-k20-02 release]$ ls -lh /dev/nvidia*
crw-rw---- 1 root root 246, 0 Jun 20 10:43 /dev/nvidia-uvm
crw-rw---- 1 sergeyev root 195, 0 Jun 19 11:38 /dev/nvidia0
crw-rw---- 1 sergeyev root 195, 1 Jun 19 11:38 /dev/nvidia1
crw-rw---- 1 sergeyev root 195, 2 Jun 19 11:38 /dev/nvidia2
crw-rw---- 1 sergeyev root 195, 3 Jun 19 11:38 /dev/nvidia3
crw-rw---- 1 mboisson root 195, 4 Jun 19 11:38 /dev/nvidia4
crw-rw---- 1 mboisson root 195, 5 Jun 19 11:38 /dev/nvidia5
crw-rw---- 1 mboisson root 195, 6 Jun 19 11:38 /dev/nvidia6
crw-rw---- 1 mboisson root 195, 7 Jun 19 11:38 /dev/nvidia7
crw-rw---- 1 root root 195, 8 Jun 19 11:38 /dev/nvidia8
crw-rw---- 1 root root 195, 9 Jun 19 11:38 /dev/nvidia9
crw-rw-rw- 1 root root 195, 255 Jun 19 11:38 /dev/nvidiactl

User mboisson has
[mboisson@gpu-k20-02 release]$ env | grep CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=4,5,6,7

Yet, running any cuda application ends up in an error. For example, the deviceQuery sample gives :
[mboisson@gpu-k20-02 release]$ ./deviceQuery
./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 38
→ no CUDA-capable device is detected
Result = FAIL

Why is that not working ?

Robert_Crovella · June 26, 2014, 12:30pm

If you’re using Torque/Moab, I think Moab will handle this for you. Contact Adaptive.

If you’re using plain Torque, I think the usual approach:

[url]http://gehrcke.de/2013/09/setting-up-torque-pbs-for-gpu-job-scheduling/[/url]

is to set CUDA_VISIBLE_DEVICES in a job preamble script (based on the requested GPUs and those assigned by Torque), and assume that there are no rogue users who will unset/reset this environment variable.

Are niether of these applicable to your case?

I have seen the permissions method work “successfully” at a university I consulted at, but there were other somewhat undesirable side effects that I don’t recall at the moment. I would have to refresh my memory which would require substantial effort. The permissions of the devices had to be modified at each job launch.

In your case, for the specific example shown, what would happen if you set CUDA_VISIBLE_DEVICES to “0-3” instead of “4-7” ?

mboisson · June 26, 2014, 4:35pm

Hi,
Torque does set CUDA_VISIBLE_DEVICES, but this is modifiable by the user. The assumption that no rogue user will change this, or simply ignore it (through some creative way) is not a satisfying one.

This is like assuming that jobs will not take more memory than requested, or that they will not request 1 core and use 10. This can be and is constrained using cpusets and cgroups.

In the specific example, if I set CUDA_VISIBLE_DEVICES to 0-3, it would (and should) fail.

Maxime

mboisson · June 26, 2014, 4:44pm

Oh… actually, I just tested, and it DOES work.

So, I guess the problem is that if we are to set permissions like that, Torque always needs to set CUDA_VISIBLE_DEVICES starting at 0.

Interresting. I will see if I can make this work.

Maxime

vacaloca · June 28, 2014, 8:09pm

FWIW, the CUDA Toolkit Getting Started guide for Linux suggests having 0666 permissions on the node links:
[url]CUDA Toolkit Documentation

That being said, it’s understandable that you’re trying to limit it to certain users/groups and perhaps you can find a workaround.

cryptexis · September 14, 2015, 10:12am

sudo apt-get install nvidia-modprobe

will fix the issue :)

Topic		Replies	Views
"no cuda-capable device" when running as non-root CUDA Programming and Performance	13	2938	January 12, 2022
2 CUDA devices - multiple user setup CUDA Programming and Performance	5	11117	August 15, 2008
Nvidia device permissions for multiple users CUDA Programming and Performance	0	2607	May 3, 2010
could not open the device file /dev/nvidiactl CUDA Programming and Performance	14	35378	May 4, 2016
CUDA Works as root but not as user on OpenSUSE 10.3 CUDA Programming and Performance	5	8026	November 6, 2008
no CUDA-capable device is available Working from root account not from my user account CUDA Programming and Performance	2	4952	June 30, 2010
unable to fing cuda device as user CUDA Programming and Performance	3	2257	August 5, 2010
/dev/nv* permissions incorrect for other users after reset Jetson TX1	2	1013	September 6, 2016
CUDA accessing ALL devices, even those which are blacklisted CUDA Programming and Performance	9	7684	October 17, 2014
CUDA & normal users CUDA Programming and Performance	7	22723	December 17, 2010

Cuda errors when permissions on /dev/nvidia* are not 666

Related topics