Using cudaMemcpy2DAsync to copy w component of one float4 array to another float4 array on different devices


I’m trying to copy a w component of a float4 into the w component of another float4 through this code:

cudaMemcpy2DAsync(d_arrayA+3sizeof(float), sizeof(float4),

But I get the error: code=11(cudaErrorInvalidValue)

I’ve checked all values and all are valid and not null.

Can I just check if I understand how cudaMemcpy2DAsync is called and that the way I reference the w components (through the 3*sizeof(float)) and the pitches (float4) for the source and destination are correct?

Thanks in advance

what are the types of the various pointers? are they all pointer to char or unsigned char or some single byte pointer? That would be OK, of course, because you happen to be doing byte-based pointer arithmetic. But if d_arrayA/B are actually float or float4 pointers, then this is wrong.

both arrays are float4, I should have mentioned that, sorry, but that’s why I asked about the pitches.

so you can’t use float4 pointers that way. When you program in CUDA, don’t flush your knowledge of C/C++. You are not doing proper pointer arithmetic here.

With float4 * pointers, you cannot reference an individual element like the w element. When you do pointer arithmetic on a float4 pointer:


you are offsetting it by units of float4. That is how pointer arithmetic works. Its not a byte offset, unless it happens to be some kind of byte pointer.

So I would try something like this:

float *d_arrayAf = (float *)d_arrayA;
float *d_arrayBf = (float *)d_arrayB;

//offsetSrc *= 4;  something like this may also be needed

cudaMemcpy2DAsync(d_arrayAf+3, sizeof(float4),

Note that offsetSrc will probably have to be modified also. If it is being computed as an offset to a float4 pointer, you will need to multiply it by 4, when used above as an offset to a float pointer.

You were correct.

But how does cudamemcpy2DAsync this work, extracting a float from a float4 and exchanging the data between devices?

I’m finding that for larger arrays sending the float4 array is actually much faster than using cudaMemcpy2DAsync, and vice versa for smaller arrays but the time difference is not that big.

Yes, that is a common observation. cudaMemcpy2D is not particularly efficient. That is, the average time per byte transferred is generally higher with cudaMemcpy2D than it is with cudaMemcpy

If you search for questions about cudaMemcpy2D, you will find others who have reached the same conclusion.

cudaMemcpy2D performs strided copies. The apparent efficiency will improve as the ratio of the data transferred per row divided by the row stride increases. In your case this ratio is rather low, at about 1/4. It works better, efficiency-wise when the ratio is closer to 1.0. It’s primary intended use is/was for copying of data to/from arrays allocated with cudaMallocPitch.

The usual suggested alternative is to copy all the data, and write a CUDA kernel to merge the data in device memory, where the memory bandwidth is much higher.

Yes, that’s what I’ve decided to do instead.

Thanks for your help and comments.