I have a kernel. If I ask for only 32 threads, it takes 4ms to complete. If I ask for 512 threads, it takes 0.8ms to complete. If I ask for 768 threads, it takes 0.7ms to complete. Why the more the threads, the faster?
The time recorded includes memory allocation on the host and device side, data transfer between host and device, and kernel execution.