kernel performance and number of threads

I have a kernel. If I ask for only 32 threads, it takes 4ms to complete. If I ask for 512 threads, it takes 0.8ms to complete. If I ask for 768 threads, it takes 0.7ms to complete. Why the more the threads, the faster?

The time recorded includes memory allocation on the host and device side, data transfer between host and device, and kernel execution.

Are you changing thread block size or grid size? Maximum size of thread block is 512 threads, but this is possible only if register usage of single thread is not above 16.

It seems that your calls with 512 and 768 threads are not executed at all. Check if you are not running out of available registers and shared memory.

If you are using cutil, you can do CUT_CHECK_ERROR after calling the kernel. Otherwise, calling cudaThreadSynchronize, then cudaGetLastError, then cudaGetErrorString will give you any error messages resulting from the kernel launch. It is always a good idea to do this for every kernel in a project: disable the checks when DNDEBUG is defined for performance in a release build.