MPI causing trouble in memory allocation?

I have developed some mixed CUDA/MPI code which I hope to run on a cluster of S1070s, but I have a problem with either MPI or CUDA, and I don’t know which!

On our S1070 devices 0,2,3,4 are C1060s and when I try run my code over the four devices device 3 when queried with cuMemGetInfo reports 0% free memory, but then allocates some variables but quickly runs out of device memory. The other three devices all report 99% free memory before allocation, allocate all variables without error and report that after allocation 42% memory is free.

I also have access to a cluster of S1070s at a government research lab and have been trying to get this same code working on that too, to eventually run the code over the whole cluster. Here’s the strange thing. Exactly the same thing occurs on their S1070s, but… on their S1070s devices 0,1,2,3 are C1060s (on ours it’s devices 0,2,3,4) but the memory error always always always occurs on device 3! device 3 on our S1070 and device 3 on their S1070.

Any suggestions as to what is occuring and how to fix this would be welcome because this is very frustrating.

As I suggested in the other thread you posted on this subject, get a sysadmin to set up nvidia-smi to keep all the GPUs you are using in compute exclusive mode. This will ensure that you are only getting a single MPI process per GPU and if you have messed up the process-affinity/gpu-relationship (which sounds likely), it will make your program fail and give you some clues about where things are going wrong.

I really appreciate you answering my queries on this CUDA/MPI stuff, but could you explain further

  1. what is compute exclusive mode?

  2. if I am the only user of our S1070 in our dept (not the cluster) why is it that three devices appear to function OK but the fourth one does not?

  3. where is info on how to use nvidia-smi for our sysadmin?

Again, many thanks for trying to help me.

Compute exclusive mode is a mode which you can put the driver into so that no more than one CUDA process can be allocated to a given GPU under its control. If you try and run two processes on the same GPU, you will get a “no device available” style error.

As for nvidia-smi, just about everything is covered in this thread, It also prints out its usage:

avid@cuda:~$ nvidia-smi -h

nvidia-smi [OPTION1] [OPTION2 ARG] ...

NVIDIA System Management Interface program for Tesla S870

	-h, --help								  Show usage and exit

	-x, --xml-format							Produce XML log (to stdout by default, unless

												a file is specified with -f or --filename=FILE

	-l, --loop-continuously					 Probe continuously, clobbers old logfile if not printing to stdout

	-t NUM, --toggle-led=NUM					Toggle LED state for Unit <NUM>

	-i SEC, --interval=SEC					  Probe once every <SEC> seconds if the -l option

												is selected (default and minimum: 1 second)

	-f FILE, --filename=FILE					Specify log file name

	--gpu=GPUID --compute-mode-rules=RULESET	Set rules for compute programs

												where GPUID is the number of the GPU (starting at zero) in the system

												and RULESET is one of:

												0: Normal mode

												1: Compute-exclusive mode (only one compute program per GPU allowed)

												2: Compute-prohibited mode (no compute programs may run on this GPU)

	-g GPUID -c RULESET						 (short form of the previous command)

	--gpu=GPUID --show-compute-mode-rules

	-g GPUID -s								 (short form of the previous command)

	-L, --list-gpus

	-lsa, --list-standalone-gpus-also		   Also list standalone GPUs in the system along with their temperatures.

												Can be used with the -l, --loop-continuously option

	-lso, --list-standalone-gpus-only		   Only list standalone GPUs in the system along with their temperatures.

												Can be used with the -l, --loop-continuously option

That should be enough to get you going.

It really sounds like you don’t have a handle on the process number-cpu-gpu affinity and something is going wrong. Have a look at the link I posted in your other thread about using MPI_comm_split() and colours to explicitly control the process-gpu affinity.

I was using contexts, but when I replaced cuCtxCreate with cudaSetDevice it worked!

I did that because of that code snippet from mfatica (which would not compile anyway).

So why do contexts not work in this case?

No idea, I am afraid. I haven’t used the driver API all that much.

I must admit that in our cluster we have a very simple 1:1 node-gpu relationship and I have never really had much trouble with getting MPI-CUDA hybrid codes working using pretty simple code and the runtime API. Glad to hear you got it working, though.