I tried to time the duration of a kernel and the memcpy to and from the device, with and without the kernel running.
Here are the results:
with kernel:
duration of memcpy (dynamic array of 8388608 ints) host to device: 33ms
duration of kernel: 0.074ms
duration of memcpy (dynamic array of 1 int) device to host: 30ms;
without kernel:
duration of memcpy (dynamic array of 8388608 ints) host to device: 33ms
duration of memcpy (dynamic array of 1 int) device to host: 0.032ms
How come the duration of memcpy changes so drastically when the kernel is removed?