clock64() reversed

According to the CUDA C Programming Guide, the B.11. Time Function chapter introduces as follows:

It says, the result of clock64() grows up sequentially according to the clock, we can measure the time consumption by the kernel on a particular thread, if we compares two values returned by clock64() on head and tail of the target kernel.

I tried to measure the time consumption between two points as follows:

tv_start = clock64();
/* A series of kernel invocation using dynamic parallel */
kern_args = (void **)cudaGetParameterBuffer(sizeof(void *),
                                            sizeof(void *) * 3);
kern_args[0] = kgjoin;
kern_args[1] = kds_src;
kern_args[2] = kresults_src;

status = cudaLaunchDevice((void *)gpujoin_exec_outerscan,
                          kern_args, grid_sz, block_sz,
                          sizeof(kern_errorbuf) * block_sz.x,
if (status != cudaSuccess)
    return -1;
/* wait for completion of the dynamic kernel */
status = cudaDeviceSynchronize();
if (status != cudaSuccess)
    return -1;

/* how much time was consumed by the dynamic kernel? */
tv_end = clock64();

printf("tv_start=%lu tv_end=%lu\n", tv_start, tv_end);

People will expect tv_start < tv_end.
However, I observed some of trial returned tv_end larger than tv_start.
What will make this mysterious behavior?

I expect this thread runs on the same SMX before/after the cudaDeviceSynchronize().
Perhaps, we shouldn’t expect it.

You shouldn’t expect it:

In particular, in the case of dynamic parallelism, the SM of the host threadblock may change after a child kernel launch. This is one of the two main use cases at this time for threadblock level pre-emption in CUDA. In fact, with a little thought, you will realize it is necessary. If no threadblock can be pre-empted, then a child kernel may not have the resources to make forward progress, thus resulting in deadlock.

Since the SM can change (implied by smid being volatile), it’s reasonable to assume the clock64() result may be incoherent, when the values are sampled from two different SMs, since clock() and clock64() rely on per-SM counters.

Thanks, it clarified me.

In these situation, I try to use %globaltimer instead. (My friend told me just a moment before.)