administering CUDA device on multiuser machine

I’ve set up the NVIDIA SDK on a RHEL 5.1 box, I’ve got the appropriate display driver installed, (NVIDIA-Linux-x86-169.12-pkg1.run), and the example directory makes successfully. The machine is set to run at initlevel 3 (at boot).

The machine is used by a few different users, and access to the graphics card seems erratic. For example, I sometimes get the error:

[nmoore@buff release]$ pwd
/home/nmoore/NVIDIA_CUDA_SDK/bin/linux/release

[nmoore@buff release]$ ./MonteCarlo
NVIDIA: could not open the device file /dev/nvidiactl (Permission denied).
There is no device supporting CUDA.

This error sometimes goes away when I restart the machine, but there doesn’t seem to be any regularity to this behavior. Are there CUDA system/admin tools that I should learn about?

I’ve been wondering if there would be implicit problems with booting into initlevel 3. I actully have two graphics cards installed on the system (the other one an integrated motherboard card), but configuring them both to work simultaneously (one for CUDA computation, the second for rendering an x11 window) seemd like more of a challenge than I wanted to tackle.

Any comments or pointers would be appreciated!

Ok, so I read the two readme’s and found something new. One of them suggests the following init script:

o In order to run CUDA applications, the CUDA module must be 

  loaded and the entries in /dev created.  This may be achieved 

  by initializing X Windows, or by creating a script to load the 

  kernel module and create the entries.

 An example script (to be run at boot time):

 #!/bin/bash

 modprobe nvidia

 if [ "$?" -eq 0 ]; then

 # Count the number of NVIDIA controllers found.

  N3D=`/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l`

  NVGA=`/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l`

 N=`expr $N3D + $NVGA - 1`

  for i in `seq 0 $N`; do

  mknod -m 666 /dev/nvidia$i c 195 $i;

  done

 mknod -m 666 /dev/nvidiactl c 195 255

 else

  exit 1

  fi

Should this be in /etc/bashrc?

I’m still confused about the proper runlevel.

The /dev/nvidia* entries get created either when starting X, or they have to be created manually (using the script you found, or via some other mechanism of your choosing). If you’re not booting into runlevel 5, then X is not going to start, and you’ll need to ensure that the /dev/nvidia* entries are created via an initscript, or from /etc/rc.local. You definitely do not and cannot do this via /etc/bashrc, as non-root users cannot create these entries.

Fantastic!

It works now, thanks very much.

Do I have to do anything special on a multi-user machine to have the card de-allocated and re-allocated by subsequent users?

I’m interested in this as well. If you’ve got multiple Tesla GPUs in a single machine and users don’t explicitly select a device with cudaSetDevice, will it default to time sharing device 0? Is there a way from system land to guarantee exclusive access to a GPU, or will I need to do some custom stuff on my end to let the users know which GPU they should use to get “dedicated” access.

This was ages ago, but just in case you’re still listening you should check out “nvidia-smi --help” from the command line.

I don’t remember when it was added (CUDA 2.3?), but I did catch the new compute-mode-rules option that was added and have it set to compute-exclusive mode on all of our GPUs at boot time.