Why is clock64 so unreliable

I am try to print the time my app take when using GPU accelerations.
so I added this code.

__global__ void ndCudaBeginFrame(ndCudaSceneInfo& info)
{
	long long coreTicks = clock64();
	info.m_timeSlice = coreTicks;
	printf("t0 = %lld    ", coreTicks);
}

__global__ void ndCudaEndFrame(ndCudaSceneInfo& info, int frameCount)
{
	long long coreTicks = clock64();

	info.m_frameCount = frameCount;
	printf("t1 = %lld   diff= %lld\n", coreTicks, coreTicks - info.m_timeSlice);

	info.m_timeSlice = coreTicks - info.m_timeSlice;
}

this is called with 1 block and 1 thread.

but to my surprise, I am getting the print out.

t0 = 7172174314773 t1 = 7172203109538 diff= 7172203109538
t0 = 7172225478900 t1 = 7172229255455 diff= 3776555
t0 = 7172250624097 t1 = 7172243861841 diff= -6762256
t0 = 7172275867917 t1 = 7172269105730 diff= -6762187
t0 = 7172302289465 t1 = 7172295526428 diff= -6763037
t0 = 7172327896493 t1 = 7172331884130 diff= 3987637
t0 = 7172353841908 t1 = 7172347073734 diff= -6768174
t0 = 7172379242630 t1 = 7172372480320 diff= -6762310
t0 = 7172404525227 t1 = 7172408521560 diff= 3996333
t0 = 7172430869316 t1 = 7172424117998 diff= -6751318

to me that does no makes sense in any context.
thanks.

clock64 is not synchronized device wide. There is no guarantee that your first kernel and your second kernel are running on the same SM.

try using the PTX globaltimer