overhead of calling clock() ?

Does anyone know how much it costs to use the clock() function in a kernel?
My reading of the PTX parallel thread exeecution manual (ISA v3.1) suggests
it is very low indeed, essentially a register move. So should be about as fast
as incrementing an int counter.

Is this true if you call clock() at the C++ CUDA level?

The reason for asking is I’m using clock() to detect (and then abort)
infinite loops. Essentially I use (clock() < MAXTICS), where MAXTICS is a large positive
integer (eg 2000000000). This works but calling clock() often introduces an
appreciable overhead (approx doubles kernel time).

As always any help or comments very welcome
Bill
http://www.cs.ucl.ac.uk/staff/W.Langdon/

Is the loop body large? If it is only a few instructions, it wouldn’t be surprising that introducing a few more instructions doubles the time.

Otherwise, it is not unusual to see that a minor alteration of the source code results in a major effect on the generated code. To see if this is the case, I recommend using cuobjdump.

Right, the clock read is only 1 instruction (more for clock64() - at least on Fermi):

__shared__ uint32_t t;
t = clock();

output of cuobjdump -sass test.cu.obj :

/*0008*/     /*0x40001c042c000001*/ 	S2R R0, SR_ClockLo;
/*0010*/     /*0x04001e036000c000*/ 	SHL.W R0, R0, 0x1;
/*0018*/     /*0x03f01c85c9000000*/ 	STS [0x0], R0;

If you’re interested, here’s a comment about having a fast clock function in hardware https://www.youtube.com/watch?v=J9kobkqAicU. At 22:30, Burton was lamenting about the lack of a user readable clock on today’s processors. I’m not sure what he meant since x86 does have a time stamp counter, but it’s good to see NVIDIA included one :)

Note that in your case you have 2 instructions - SHL.W is a part of clock(). This should be sm20. I think that on sm30 it is indeed 1 instruction.

Dear vvolkov and Uncle Joe,
Thank you for your helpful replies. This is indeed close to what I am seeing
(using nvcc --keep and cuobjdump (5.0 V0.2.1221) BTW I am compiling with -arch sm_13.

A slightly simplified example is:

for(i = 0;clock() < 2000000000 && i<=10;i++)

It appears that nvcc is treating 2000000000 (2billion) as an unsigned long and so whilst
only one PTX instruction is being used to read the clock the compiler generates a total
of six instructions for the first part of the loop conditional (ie have we timed out yet)

mov.u32 	%r1, %clock;
	mov.s32 	%r2, %r1;
	cvt.s64.s32 	%rd1, %r2;
	mov.u64 	%rd2, 1999999999;
	setp.gt.s64 	%p1, %rd1, %rd2;
	@%p1 bra 	$L_0_2050;

I tried casting 2billion as unsigned int but it did not make any difference.

Thanks again
Bill

It turns out that nVidia defines clock() to return an item of type clock_t
HOWEVER gnu C file time.h 7.23 and types.h typedefs clock_t as long int
Hence nvcc’s insistence on converting 64 bits and doing a 64 bit signed comparison.

If clock is coerced to (unsigned int) the for loop plus time out becomes

for(i = 0;(unsigned int)clock() < 2000000000 && i<=10;i++)

and the compiler generates one fewer instructions (now 5) for the first part of the
loop conditional (ie have we timed out yet)

mov.u32 %r1, %clock;
mov.s32 %r2, %r1;
mov.u32 %r3, 1999999999;
setp.gt.u32 %p1, %r2, %r3;
@%p1 bra $L_0_2050;

(perhaps PTX can optimise away the mov from r1 to r2??)

This also explains some confusion where my code appeared to treat clock_t as signed
but the nVidia clock() doc says its 32 bits unsigned.

There is also a problem of clock() on different SM drifting apart. See
https://devtalk.nvidia.com/default/topic/537749/1-variation-in-kernel-timings-cuda-5-0-samples-bin-linux-release-clock-geforce-295-gtx/?offset=9#3816541

Apart from the actual instruction count (which should be deduced from cuobjdump -sass output, not by counting intermediate PTX instructions), clock() also has a cost in that the compiler treats it as a barrier. I.e. Use of clock() may lead to inferior instruction scheduling and potentially even inhibit other optimizations.

(Thanks to Vasily Volkov for pointing this out).

For detection of infinite loops I always use loop counters, both for their deterministic behavior and efficiency. Downside is they use an additional register and the need to be more careful with nested loops.

Thanks tera.
For completeness, with (unsigned int)clock(), the sm_13 cuobjdump -sass output is

/*0000*/     /*0x0000000160004780*/ 	S2R R0, SR1;
	/*0008*/     /*0x30010001c4100780*/ 	SHL R0, R0, 0x1;
	/*0010*/     /*0x308001fd644107c8*/ 	ISET.C0 o [0x7f], R0, c [0x1] [0x0], GT;
	/*0018*/     /*0x3000000300000280*/ 	RET C0.NE;

It’s a bit weird now. You do only 10 iterations. The total overhead of quitting on clock() should be negligible.

I’d check if there are any other differences in the generated code. If yes, you might want to play with where you place clock() - it may affect the code a lot.

sorry this is just a tiny cut down example from a much bigger kernel
Bill

Dear Tera,
Just to confirm I have now replaced clock() with an additional counter.
This also has the advantage that the timeout does not need to be much longer on
much slower GPUs and its simplier when dealing with newer GPUs which appear not
to reset their clocks at the start of launching a kernel.

new code

#define OK() ((OKcounter++ < MAXLOOPS)? 1 : 0)
int OKcounter = 0;
...
  for(i = 0;OK() && Y+i < height;i++) {
    ...
  }

I am currently setting MAXLOOPS to one hundred times bigger than the largest
value expected. Equally well it could just be the largest value.

Thanks again

Bill
http://www.cs.ucl.ac.uk/staff/W.Langdon/