I’ve been given this CFD code to optimise. A lot of the variables are declared as float4 and bundled into a float4 (e.g. position in x,y,z and pressure in w), but in many cases only one component needs to be transfered between devices. The data is held in a C++ vector of float4* called fields and to transfer a field between devices the following call is made:
however the performance may not be at maximum rates.
Otherwise, you could create a temporary buffer, copy the selected data to the buffer, then transfer that buffer. You’d have to do an equivalent scatter operation on the other end.
This can still be a win because device memory bandwidth is usually much higher than device->device transfer bandwidth.
Using 2D copies is the most convenient way. However, the transfer speed of such strided copies may be only one tenth of contiguous bulk copies between CPU and GPU. Therefore the method of re-sorting data in either a CPU-side buffer (that fits into L1 cache) or on the GPU side should be given serious consideration where performance is important. As txbob points out, this takes advantage of high memory bandwidth in the range of hundreds of GB per second.
Yes, I can see the advantage of writing the w component to a buffer and transferring the buffer, but how would I use cudaMemcpy2DAsync to transfer just the w component of a 1d array of float4?