Using a cluster of S1070s

chrismc · November 27, 2009, 6:23pm

I’m trying to use a cluster of S1070s but not having much success. I can use MPI over 4 devices within one S1070, but here’s what happens when I try to use 8 devices over two S1070s of a cluster.

I run the MPI with

mpirun -hostfile hostsfile -np 8

and hostsfile is
node0
node1
node0
node1
node0
node1
node0
node1

and am mapping the MPI processes to the devices with the switch statement

//configuration for 2x4
switch(rank)
{
case 0 :
case 1 : DEVICE = 0;
break;
case 2 :
case 3 : DEVICE = 1;
break;

	case	4	:
	case	5	:	DEVICE = 2;
				break;

	case	6	:
	case	7	:	DEVICE = 3;
				break;
}

I assumed that this would map
Process 0 → node 0 device 0
Process 1 → node 1 device 0
Process 2 → node 0 device 1
Process 3 → node 1 device 1
Process 4 → node 0 device 2
Process 5 → node 1 device 2
Process 6 → node 0 device 3
Process 7 → node 1 device 3

but apprently not so. What appears to be happening is that both processes 0 and 1 are being mapped onto node 0 device 0 and as a consquence that device runs out of device memory and many variables in process 1 do not get allocated and I get invalid device pointers when I try to use them.

Similarly for processes 2 and 3, processes 4 and 5, processes 6 and 7.

Does anyone have any suggestions and/or remedies as to what is going wrong and how to rectify it?

avidday · November 27, 2009, 9:30pm

This really doesn’t have anything to do with CUDA. It is a classic MPI processor affinity question.

Having said that, there are a few things you should do to fix this.

1.Are you using mpich2? If so, use the MPICH2 --ncpus option and specify the hosts in the hosts file as

host:ncpus

That will put mpich2 out of “round-robin” allocation and into smp mode, where nodes get “filled” linearly up to the total number of cpus you specify, and then round robin after that. The details of how to do it are in the mpich2 users guide pdf. This makes the node-rank relationship much more predictable. Open MPI has a command line switch that does the same thing, but I can’t remember what it is off the top of my head. If you are using a more exotic MPI implement, you are on your own.

Using MPI_Comm_split in your code to split the communicator into colours and then reassign ranks in the communicator based on colour groups. Massimo Fatica posted a nice code snippet a while ago that shows the general idea here.
Get someone to put your S1070s into compute exclusive mode, so that when/if your code or MPI setup goes wrong, you can’t wind up with more than one live compute context per GPU.