Can anyone explain the difference in time?

I tried to time the duration of a kernel and the memcpy to and from the device, with and without the kernel running.

Here are the results:

with kernel:
duration of memcpy (dynamic array of 8388608 ints) host to device: 33ms
duration of kernel: 0.074ms
duration of memcpy (dynamic array of 1 int) device to host: 30ms;

without kernel:
duration of memcpy (dynamic array of 8388608 ints) host to device: 33ms
duration of memcpy (dynamic array of 1 int) device to host: 0.032ms

How come the duration of memcpy changes so drastically when the kernel is removed?

Are you sure you cudaThreadSynchronize() after invoking your kernel? Seems it is not hte case and you’re getting wrong results – your kernel running time is accounted as memcpy because memcpy performs implicit synchronization.

Thanks I’ll try with cudaThreadSynchronize().