I just ran a simple test, it seems to be working for me. I have a RHEL 6.2 node with CUDA 7.0 that has 3 GPUs in it:
$ nvidia-smi
Fri Oct 23 05:58:41 2015
+------------------------------------------------------+
| NVIDIA-SMI 346.46 Driver Version: 346.46 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla C2075 Off | 0000:03:00.0 Off | 0 |
| 30% 51C P0 0W / 225W | 9MiB / 5375MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 NVS 310 Off | 0000:04:00.0 N/A | N/A |
| 30% 42C P0 N/A / N/A | 3MiB / 511MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K40c Off | 0000:82:00.0 Off | 0 |
| 23% 38C P0 65W / 235W | 23MiB / 11519MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 C Not Supported |
+-----------------------------------------------------------------------------+
$
Note that the only cc3.5+ GPU here is the K40c, and also note that all GPUs are currently in Default compute mode. Also note that the K40c GPU is enumerated here as device 2, but if I were to check the enumeration under CUDA (for example, by running deviceQuery), it will be enumerated as device 0. This distinction is important in the following discussion.
Following the MPS instructions here:
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
for the use case covered in section 5.1.1 (multi-user setup), I created the following scripts:
start_as_root.bash:
#!/bin/bash
# the following must be performed with root privilege
export CUDA_VISIBLE_DEVICES="0"
nvidia-smi -i 2 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d
Note that in the above script, I am restricting the CUDA device to 0 (corresponding to the CUDA enumeration order), but the device I select for modification of the compute mode is device 2 (corresponding to the nvidia-smi enumeration order).
stop_as_root.bash:
#!/bin/bash
echo quit | nvidia-cuda-mps-control
nvidia-smi -i 2 -c DEFAULT
The above two scripts are used to start and stop the MPS server control daemon.
I also have a script to run the test.
test.bash:
#!/bin/bash
/usr/lib64/openmpi/bin/mpirun -n 2 simpleMPI/simpleMPI
When I run the following sequence, everything seems to work correctly:
$ su
Password:
# ./start_as_root.bash
Set compute mode to EXCLUSIVE_PROCESS for GPU 0000:82:00.0.
All done.
# exit
exit
$ ./test.bash
Running on 2 nodes
Average of square roots is: 0.667279
PASSED
$ su
Password:
# ./stop_as_root.bash
Set compute mode to DEFAULT for GPU 0000:82:00.0.
All done.
# exit
exit
$
As a proof-point, we could observe what happens if I run the test script with the GPU set to EXCLUSIVE_PROCESS mode but the daemon is not running:
$ su
Password:
# nvidia-smi -i 2 -c EXCLUSIVE_PROCESS
Set compute mode to EXCLUSIVE_PROCESS for GPU 0000:82:00.0.
All done.
# exit
exit
$ ./test.bash
Running on 2 nodes
CUDA error calling "cudaMalloc((void **)&deviceInputData, dataSize * sizeof(float))", code is 10
Test FAILED
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 10.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
$
Other notes:
-
The enumeration order of nvidia-smi does not depend on the CUDA runtime, and follows the PCI enumeration order. The enumeration order of the CUDA runtime follows a heuristic that generally tries to order the “most powerful” GPU first.
-
On my node I was using OpenMPI, as that is conveniently installed as part of the RHEL 6.2 distribution. I copied the contents of the
/usr/local/cuda-7.0/samples/0_Simple/simpleMPI
directory to a local directory, then built the code with the following command:
nvcc -o simpleMPI -I/usr/include/openmpi-x86_64 -I/usr/local/cuda/samples/common/inc -L/usr/lib64/openmpi/lib -lmpi_cxx simpleMPI.cpp simpleMPI.cu