Slow device to host transfer

I’m trying to transfer data from device to host, using cudaMemcpy
The size of the data is only 12KB, but it is taking about 50mSec.

Is this normal?

Transfering from host to device is much faster though (4KB in 230uSec)

Any thinkable cause for this? :blink:

EDIT: Actually no matter the size, its always 50mS…thats even weirder :|

I measured 90us overhead for host to device transfer, 16us overhead for device to host transfer.

So your results are not normal. But I have no idea for the cause. Maybe you post a bit of your code?