Question about using cudaMemcpy in mixed CUDA/MPI Programming

Is there a MPI function that can send/receive or broadcast float4?

I am working with a number of float4 variables (position, velocity etc) and need to use MPI to send and receive blocks of data between MPI processes. MPI functions require the specification of a type, e.g. MPI_Float. My CUDA kernels are all set up to use float4s, but to transmit between processes I need these change these float4s into MPI_Float.

So on the host side I have the floats h_x,h_y,h_z but on the device side I have float4 d_x.

A typical communication would be

1.process i cudaMemcpy a float4 d_x to host variables float h_x,float h_y and float h_z
2.process i MPI_Bcast h_x,h_y and h_z (but this can only be in MPI_Float)
3.all processes cudaMemcpy their copies of h_x,h_y and h_z to float4 d_x

How can transfer d_x into h_x,h_y and h_z, and vice versa after the MPI_Bcast?

No. You will need to use something like MPI_Type_contiguous() to define the float4 to MPI.

Yes, that looks like the way to go.

Thanks.