disable cuda scheduler having multiple CUDA capable devices

Hi Users, hi Developers,

i’m having trouble on a CentOS machine with multiple users and multiple CUDA devices (in my case 4 C2050).

What i want to do is to dynamically assign CUDA devices to users by changing file permissions of the /dev/nvidia[0-3] devices.
The problem comes, as CUDA device specification diverges from device file names.

For instance giving a user /dev/nvidia2 would the user require to submit the job via -device=0 anyways, since nvidias kernel module internals
know that the user owns just one device … -device=2 would fail telling that there is just one CUDA device (from the view point of the user).

This behaviour leads me to race conditions, since when a job stops for a user (and permissions get removed) and independently another one gets
started the device numbers interfere.

So my final question is:

Do i have the possibility to disable this dynamic device addressing completely?
I just want to have static -device=X to /dev/nvidiaX binding.

Any other sugestions are also welcome.

Thanks in advance,
Gabriel

You might want to take a look at cudawrapper from NCSA. It is an LD_PRELOAD library that can dynamically assign GPUs to jobs in a batch queue system such that every user only needs to request device 0. The SourceForge page is tremendously confusing with all the clutter, but here’s a link directly to the admin README:

you have to generate a mapping yourself with nvidia-smi and the CUDA API. from that point, you can reorder devices manually using the CUDA_VISIBLE_DEVICES environment variable

Thank you very very much.

That was almost exactly the solution i was looking for.

So far it works perfectly for me,

Gabriel