Probably because your time measurement includes the kernel execution time. Kernel launches are asynchronous, so unless you are including an explicit syncthreads() barrier after your kernel call, your host timer will spinlock at the cudaMemcpy() until the kernel finishes running and the copy is executed.