I have developed some mixed CUDA/MPI code which I hope to run on a cluster of S1070s, but I have a problem with either MPI or CUDA, and I don’t know which!
On our S1070 devices 0,2,3,4 are C1060s and when I try run my code over the four devices device 3 when queried with cuMemGetInfo reports 0% free memory, but then allocates some variables but quickly runs out of device memory. The other three devices all report 99% free memory before allocation, allocate all variables without error and report that after allocation 42% memory is free.
I also have access to a cluster of S1070s at a government research lab and have been trying to get this same code working on that too, to eventually run the code over the whole cluster. Here’s the strange thing. Exactly the same thing occurs on their S1070s, but… on their S1070s devices 0,1,2,3 are C1060s (on ours it’s devices 0,2,3,4) but the memory error always always always occurs on device 3! device 3 on our S1070 and device 3 on their S1070.
Any suggestions as to what is occuring and how to fix this would be welcome because this is very frustrating.
As I suggested in the other thread you posted on this subject, get a sysadmin to set up nvidia-smi to keep all the GPUs you are using in compute exclusive mode. This will ensure that you are only getting a single MPI process per GPU and if you have messed up the process-affinity/gpu-relationship (which sounds likely), it will make your program fail and give you some clues about where things are going wrong.
Compute exclusive mode is a mode which you can put the driver into so that no more than one CUDA process can be allocated to a given GPU under its control. If you try and run two processes on the same GPU, you will get a “no device available” style error.
As for nvidia-smi, just about everything is covered in this thread, It also prints out its usage:
avid@cuda:~$ nvidia-smi -h
nvidia-smi [OPTION1] [OPTION2 ARG] ...
NVIDIA System Management Interface program for Tesla S870
-h, --help Show usage and exit
-x, --xml-format Produce XML log (to stdout by default, unless
a file is specified with -f or --filename=FILE
-l, --loop-continuously Probe continuously, clobbers old logfile if not printing to stdout
-t NUM, --toggle-led=NUM Toggle LED state for Unit <NUM>
-i SEC, --interval=SEC Probe once every <SEC> seconds if the -l option
is selected (default and minimum: 1 second)
-f FILE, --filename=FILE Specify log file name
--gpu=GPUID --compute-mode-rules=RULESET Set rules for compute programs
where GPUID is the number of the GPU (starting at zero) in the system
and RULESET is one of:
0: Normal mode
1: Compute-exclusive mode (only one compute program per GPU allowed)
2: Compute-prohibited mode (no compute programs may run on this GPU)
-g GPUID -c RULESET (short form of the previous command)
--gpu=GPUID --show-compute-mode-rules
-g GPUID -s (short form of the previous command)
-L, --list-gpus
-lsa, --list-standalone-gpus-also Also list standalone GPUs in the system along with their temperatures.
Can be used with the -l, --loop-continuously option
-lso, --list-standalone-gpus-only Only list standalone GPUs in the system along with their temperatures.
Can be used with the -l, --loop-continuously option
That should be enough to get you going.
It really sounds like you don’t have a handle on the process number-cpu-gpu affinity and something is going wrong. Have a look at the link I posted in your other thread about using MPI_comm_split() and colours to explicitly control the process-gpu affinity.
No idea, I am afraid. I haven’t used the driver API all that much.
I must admit that in our cluster we have a very simple 1:1 node-gpu relationship and I have never really had much trouble with getting MPI-CUDA hybrid codes working using pretty simple code and the runtime API. Glad to hear you got it working, though.