while using the exclusive compute mode (Tesla S1070) for multiple users in batch mode, I want to make sure that a user can only use a device that he has requested (in SGE). That means if we give him device no 1 (by SGE… the normal nvidia-smi scheduling is not a option at the moment as it is not working with OpenCL), we like to “forbid” him to do cudaSetDevice(0) as then his own job AND the other jobs running on device 0 will be aborted.
We thought we could realise this by setting/deleting permissions to certain groups to the nvidia devices (/dev/nvidia*, /dev/nvidiactl), so that only his own job would be aborted and not the other user’s one (as he cannout access this device).
However, when I delete read-write other-users-privileges for nvidia device 0 and execute a small program where the user does cudaSetDevice(1), it is not working: “NVIDIA: could not open the device file /dev/nvidia0 (Permission denied).”. This error occurs while setting the device to 1 and the error string says “invalid argument” (code 11). Somehow CUDA seems to need permissions to all GPUs. Why? And more important: Is there a possiblity how you could solve the problem?
BTW: I found the same (unsolved) problem here: [topic=“74985”]2 CUDA devices - multiple user setup[/topic]. But as it’s a long time ago, I hope that there are any improvements now.