I have just successfully ran an MPI program that uses CUDA. I set it up so that I have a 2-node cluster, with a combined total of 10 CPUs. I launched 10 instances of MPI (1 per CPU) using mpdboot and a host file. My program is a simple MPI program that executes 1 process per CPU, and each of those processes calls a CUDA kernel.
This works because each of the CUDA cards in the cluster supports the “Default” compute mode, where a single card can be used simultaneously by multiple processes. However, running more kernels than there are CUDA cards is probably not the best practice.
Also, there are a total of 5 cuda cards in these 2 nodes, and is it stands all the processes running on the same host will execute on the same card, leaving all secondary and tertiary cards idle.
This makes me wonder if there is a CUDA-aware Torque / Maui module or plug-in out there, that can handle launching only as many kernels on a box as cards it has, and setting the active cuda card differently for each process on a specific box. It could possibly alter the data used in invoking the kernel so as to tune the program for the different cards. This may not be a good idea, however, as it gets in to how to actually split up the job, which is historically the job of the programmer and not the tools.
Also, MPI supports different processes talking to each other and syncing on each other, which is currently impossible to do with CUDA. This is due to the fact that not all blocks in a specific CUDA program are executing simultaneously if, say, there are more blocks that need to run than there are multprocessors available to run the blocks. So, the blocks themselves can’t talk to each other on the same host, and the threads within a block certainly cannot talk to a thread on a different host through some sort of MPI call. “Fixing” this problem probably is impossible as it it tied pretty directly with the architecture of the graphics chip. Plus, GPUs don’t know anything about network cards, so it would have to notify the CPU to run an MPI call, and then wait for the network lag, and the whole thing would be a disaster.
So, I guess what I’m trying to say is that with the current technologies, it only makes sense to use CUDA and MPI together if your problem has extremely fine-grained parallelism and ONLY has extremely local data dependencies (or no dependencies at all, which is the best case). This really narrows down the space of problems that this technology can attack. On the other hand, it can still do some very cool things.
For people searching on the net, here is what I did to make my simple MPI and CUDA application:
global void kernel(int *array1, int *array2, int *array3)
int index = blockIdx.x * blockDim.x + threadIdx.x;
array3[index] = array1[index] + array2[index];
int i, array1, array2, array3, *devarray1, *devarray2, *devarray3;
for(i = 0; i < 6; i++)
array1[i] = i;
array2[i] = 3-i;
cudaMalloc((void**) &devarray1, sizeof(int)*6);
cudaMalloc((void**) &devarray2, sizeof(int)*6);
cudaMalloc((void**) &devarray3, sizeof(int)*6);
cudaMemcpy(devarray1, array1, sizeof(int)*6, cudaMemcpyHostToDevice);
cudaMemcpy(devarray2, array2, sizeof(int)*6, cudaMemcpyHostToDevice);
kernel<<<2, 3>>>(devarray1, devarray2, devarray3);
cudaMemcpy(array3, devarray3, sizeof(int)*6, cudaMemcpyDeviceToHost);
for(i = 0; i < 6; i++)
printf("%d ", array3[i]);
int main(int argc, char *argv)
int rank, size;
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
now, for compilation:
[codebox]$ nvcc -c kernel.cu
$ mpicc -o mpicuda mpi.c kernel.o -lcudart -L /usr/local/cuda/lib -I /usr/local/cuda/include
and running it:
[codebox]$ mpirun -l -np 10 ./mpicuda
1: 3 3 3 3 3 3
9: 3 3 3 3 3 3
8: 3 3 3 3 3 3
2: 3 3 3 3 3 3
7: 3 3 3 3 3 3
6: 3 3 3 3 3 3
0: 3 3 3 3 3 3
4: 3 3 3 3 3 3
5: 3 3 3 3 3 3
3: 3 3 3 3 3 3