Mixed CUDA and MPI programming

chrismc · November 12, 2009, 1:01pm

I’m trying to distribute my data in blocks across a cluster of S1070s, but each device needs to process its block of data with all the other blocks of data. I had hoped that I could use MPI_Send and MPI_Recv to transfer the blocks of data between devices, but I’m getting segmentation faults. Is this due to MPI_Send and MPI_recv only operating on host data not device data?

Tigga · November 12, 2009, 1:22pm

Yes. You can’t copy directly from GPU to GPU, and MPI doesn’t know about CUDA.

chrismc · November 12, 2009, 1:43pm

So assuming that one device holds the data for one block of particles then to send one block of particle data from one device to all other devices would require

copying the data block from the device to the process host memory via cudaMemcpy(hostdata,devicedata,size,CudaMemcpyDeviceToHost)
MPI_send hostdata to all other processes so each has its own copy via for(proc=0 ; proc<nprocs ; proc++) MPI_Send(hostdata,size, type,proc,0,MPI_COMM_WORLD) with corresponding calls to MPI_Recv by the receiving processes
each process then copying hostdata to devicedata via cudaMemcpy(devicedata,hostdata,size,CudaMemcpyHostToDevice)

It seems a bit time consuming. Is there a faster way?

theMarix · November 12, 2009, 1:45pm

That’s the way to go.

Tigga · November 12, 2009, 1:52pm

There is no way to transfer data from a GPU, or to a GPU, without going through the host first.

It’s possible to hide memory trasfer times by using streams and asynchronous MPI sends in some situations. Also, do you have one process per GPU, or one process per S1070? It may be more efficent to have one process control all four devices on one S1070.

chrismc · November 12, 2009, 2:03pm

I was hoping that maybe step 2. could be omitted in some way, so that once the sending device has transferred its data to the host then all other processes see that data and can then read it directly into their device memory rather than make a copy of it on their host and then transfer it to their device.

Tigga · November 12, 2009, 2:05pm

Well… if you’re transfering a large amount of data you could do it in many small chunks. This way you could have data being copied off the GPU at the same time as it was being copied using MPI.

chrismc · November 12, 2009, 2:13pm

That is possibly a very good suggestion! External Media

Thanks alot.

Topic		Replies	Views
CUDA-aware MPI on 1 GPU transferring data to host? Legacy PGI Compilers	7	5198	October 2, 2017
using all 4 GPUs in S1070 from multi-core cpu? how CUDA Programming and Performance	11	32419	December 13, 2010
copy in multi GPUs CUDA Programming and Performance	13	4807	February 7, 2009
Using a cluster of S1070s CUDA Programming and Performance	1	1136	November 27, 2009
Possibility to do d2d memcpy w/o CPU or w/o PCIe? CUDA Programming and Performance	4	5024	May 19, 2010
Data copy between multi-GPUs CUDA Programming and Performance	2	1568	October 14, 2008
Sending data from the host to the device and vice versa (CUDA - single GPU, MPI - multi-processes) CUDA Programming and Performance	0	416	June 15, 2018
Mpi+cuda fortran CUDA Setup and Installation cuda , openmpi	1	1024	April 5, 2022
An Introduction to CUDA-Aware MPI Technical Blog	5	958	August 30, 2019
CUDA and MPI Cluster Computing Implementation. Need advice for setting up MPI and CUDA over a cluste CUDA Programming and Performance	2	2482	February 19, 2010

Mixed CUDA and MPI programming

Related topics