Nowadays I am just trying to using CUDA, but something really strange confuse me for a long time. Can you help me?
The problem is like this, when I am timing the speed of cudaMemcpy, I found for about 1.5mbyte data, from host to device will cost very little, which shows 0ms. While for 0.6mbyte data, from device to host will cost as much as 15ms!
I don’t know why this happened? Is it something wrong in my program or it just very slow for gpu to translate data to host?
I am very appreciate for your answer.