Mixed CUDA and MPI programming

I’m trying to distribute my data in blocks across a cluster of S1070s, but each device needs to process its block of data with all the other blocks of data. I had hoped that I could use MPI_Send and MPI_Recv to transfer the blocks of data between devices, but I’m getting segmentation faults. Is this due to MPI_Send and MPI_recv only operating on host data not device data?

Yes. You can’t copy directly from GPU to GPU, and MPI doesn’t know about CUDA.

So assuming that one device holds the data for one block of particles then to send one block of particle data from one device to all other devices would require

  1. copying the data block from the device to the process host memory via cudaMemcpy(hostdata,devicedata,size,CudaMemcpyDeviceToHost)

  2. MPI_send hostdata to all other processes so each has its own copy via for(proc=0 ; proc<nprocs ; proc++) MPI_Send(hostdata,size, type,proc,0,MPI_COMM_WORLD) with corresponding calls to MPI_Recv by the receiving processes

  3. each process then copying hostdata to devicedata via cudaMemcpy(devicedata,hostdata,size,CudaMemcpyHostToDevice)

It seems a bit time consuming. Is there a faster way?

That’s the way to go.

There is no way to transfer data from a GPU, or to a GPU, without going through the host first.

It’s possible to hide memory trasfer times by using streams and asynchronous MPI sends in some situations. Also, do you have one process per GPU, or one process per S1070? It may be more efficent to have one process control all four devices on one S1070.

I was hoping that maybe step 2. could be omitted in some way, so that once the sending device has transferred its data to the host then all other processes see that data and can then read it directly into their device memory rather than make a copy of it on their host and then transfer it to their device.

Well… if you’re transfering a large amount of data you could do it in many small chunks. This way you could have data being copied off the GPU at the same time as it was being copied using MPI.

That is possibly a very good suggestion! :thumbsup:

Thanks alot.