According to the CUDA C Programming Guide, the B.11. Time Function chapter introduces as follows:
It says, the result of clock64() grows up sequentially according to the clock, we can measure the time consumption by the kernel on a particular thread, if we compares two values returned by clock64() on head and tail of the target kernel.
I tried to measure the time consumption between two points as follows:
tv_start = clock64(); : /* A series of kernel invocation using dynamic parallel */ kern_args = (void **)cudaGetParameterBuffer(sizeof(void *), sizeof(void *) * 3); kern_args = kgjoin; kern_args = kds_src; kern_args = kresults_src; status = cudaLaunchDevice((void *)gpujoin_exec_outerscan, kern_args, grid_sz, block_sz, sizeof(kern_errorbuf) * block_sz.x, NULL); if (status != cudaSuccess) return -1; : /* wait for completion of the dynamic kernel */ status = cudaDeviceSynchronize(); if (status != cudaSuccess) return -1; /* how much time was consumed by the dynamic kernel? */ tv_end = clock64(); printf("tv_start=%lu tv_end=%lu\n", tv_start, tv_end);
People will expect tv_start < tv_end.
However, I observed some of trial returned tv_end larger than tv_start.
What will make this mysterious behavior?
I expect this thread runs on the same SMX before/after the cudaDeviceSynchronize().
Perhaps, we shouldn’t expect it.