Hi,
If you want to measure the pure kernel loading time, it’s better to make sure launched jobs can fill into hw at the same time.
Otherwise, longer execution time for waiting for available hw is expected.
Query hw compansity
./NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery/deviceQuery
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
And synchronize also paid for waiting.
line21: // cudaThreadSynchronize();
line29: // cudaThreadSynchronize();
line56: kernel<<<63,1024>>>(1,2,3);
count:10 max:0.032000 min:0.024000 ave:0.026400