About the speed of cudaMemcpy

Nowadays I am just trying to using CUDA, but something really strange confuse me for a long time. Can you help me? External Image

The problem is like this, when I am timing the speed of cudaMemcpy, I found for about 1.5mbyte data, from host to device will cost very little, which shows 0ms. While for 0.6mbyte data, from device to host will cost as much as 15ms!

I don’t know why this happened? Is it something wrong in my program or it just very slow for gpu to translate data to host?

I am very appreciate for your answer.

Can you show your code?

Sounds like a timing issue

int ffFitSize=ffFitWidffFitHeisizeof(uchar3);
cudaMemcpy(outputImage, ffFit,ffFitSize,cudaMemcpyDeviceToHost);

Just these two sentences, but i found if ffFit is float4, it will cost 16ms,but if it is uchar4,it will cost very little like 2ms, why these things happened?

Maybe because an uchar4 is 4 bytes and a float4 is 16 bytes?

That is a factor 4

Yes, but the speed is actually more than 8 times slower?
Why?
Dose cudaMemcpy has some trick to copy char more quick than float?

That my friend I don’t know, never ever understanded the memcpy and what it does.

I came across the same problem: And it was just a timing issue. See Section 3.2.6.2 of the 2.2 Programming Guide.