overhead of calling clock() ?

wlangdon · May 17, 2013, 1:55pm

Does anyone know how much it costs to use the clock() function in a kernel?
My reading of the PTX parallel thread exeecution manual (ISA v3.1) suggests
it is very low indeed, essentially a register move. So should be about as fast
as incrementing an int counter.

Is this true if you call clock() at the C++ CUDA level?

The reason for asking is I’m using clock() to detect (and then abort)
infinite loops. Essentially I use (clock() < MAXTICS), where MAXTICS is a large positive
integer (eg 2000000000). This works but calling clock() often introduces an
appreciable overhead (approx doubles kernel time).

As always any help or comments very welcome
Bill

vvolkov · May 18, 2013, 1:33am

Is the loop body large? If it is only a few instructions, it wouldn’t be surprising that introducing a few more instructions doubles the time.

Otherwise, it is not unusual to see that a minor alteration of the source code results in a major effect on the generated code. To see if this is the case, I recommend using cuobjdump.

Uncle_Joe · May 20, 2013, 9:03pm

Right, the clock read is only 1 instruction (more for clock64() - at least on Fermi):

__shared__ uint32_t t;
t = clock();

output of cuobjdump -sass test.cu.obj :

/*0008*/     /*0x40001c042c000001*/ 	S2R R0, SR_ClockLo;
/*0010*/     /*0x04001e036000c000*/ 	SHL.W R0, R0, 0x1;
/*0018*/     /*0x03f01c85c9000000*/ 	STS [0x0], R0;

If you’re interested, here’s a comment about having a fast clock function in hardware https://www.youtube.com/watch?v=J9kobkqAicU. At 22:30, Burton was lamenting about the lack of a user readable clock on today’s processors. I’m not sure what he meant since x86 does have a time stamp counter, but it’s good to see NVIDIA included one :)

vvolkov · May 21, 2013, 1:49am

Note that in your case you have 2 instructions - SHL.W is a part of clock(). This should be sm20. I think that on sm30 it is indeed 1 instruction.

wlangdon · May 21, 2013, 10:11am

Dear vvolkov and Uncle Joe,
Thank you for your helpful replies. This is indeed close to what I am seeing
(using nvcc --keep and cuobjdump (5.0 V0.2.1221) BTW I am compiling with -arch sm_13.

A slightly simplified example is:

for(i = 0;clock() < 2000000000 && i<=10;i++)

It appears that nvcc is treating 2000000000 (2billion) as an unsigned long and so whilst
only one PTX instruction is being used to read the clock the compiler generates a total
of six instructions for the first part of the loop conditional (ie have we timed out yet)

mov.u32 	%r1, %clock;
	mov.s32 	%r2, %r1;
	cvt.s64.s32 	%rd1, %r2;
	mov.u64 	%rd2, 1999999999;
	setp.gt.s64 	%p1, %rd1, %rd2;
	@%p1 bra 	$L_0_2050;

I tried casting 2billion as unsigned int but it did not make any difference.

Thanks again
Bill

wlangdon · May 24, 2013, 3:51pm

It turns out that nVidia defines clock() to return an item of type clock_t
HOWEVER gnu C file time.h 7.23 and types.h typedefs clock_t as long int
Hence nvcc’s insistence on converting 64 bits and doing a 64 bit signed comparison.

If clock is coerced to (unsigned int) the for loop plus time out becomes

for(i = 0;(unsigned int)clock() < 2000000000 && i<=10;i++)

and the compiler generates one fewer instructions (now 5) for the first part of the
loop conditional (ie have we timed out yet)

mov.u32 %r1, %clock;
mov.s32 %r2, %r1;
mov.u32 %r3, 1999999999;
setp.gt.u32 %p1, %r2, %r3;
@%p1 bra $L_0_2050;

(perhaps PTX can optimise away the mov from r1 to r2??)

This also explains some confusion where my code appeared to treat clock_t as signed
but the nVidia clock() doc says its 32 bits unsigned.

There is also a problem of clock() on different SM drifting apart. See

tera · May 24, 2013, 8:08pm

Apart from the actual instruction count (which should be deduced from cuobjdump -sass output, not by counting intermediate PTX instructions), clock() also has a cost in that the compiler treats it as a barrier. I.e. Use of clock() may lead to inferior instruction scheduling and potentially even inhibit other optimizations.

(Thanks to Vasily Volkov for pointing this out).

For detection of infinite loops I always use loop counters, both for their deterministic behavior and efficiency. Downside is they use an additional register and the need to be more careful with nested loops.

wlangdon · May 25, 2013, 8:31am

Thanks tera.
For completeness, with (unsigned int)clock(), the sm_13 cuobjdump -sass output is

/*0000*/     /*0x0000000160004780*/ 	S2R R0, SR1;
	/*0008*/     /*0x30010001c4100780*/ 	SHL R0, R0, 0x1;
	/*0010*/     /*0x308001fd644107c8*/ 	ISET.C0 o [0x7f], R0, c [0x1] [0x0], GT;
	/*0018*/     /*0x3000000300000280*/ 	RET C0.NE;

vvolkov · May 25, 2013, 11:50am

It’s a bit weird now. You do only 10 iterations. The total overhead of quitting on clock() should be negligible.

I’d check if there are any other differences in the generated code. If yes, you might want to play with where you place clock() - it may affect the code a lot.

wlangdon · May 25, 2013, 4:01pm

sorry this is just a tiny cut down example from a much bigger kernel
Bill

wlangdon · June 19, 2013, 8:55am

Dear Tera,
Just to confirm I have now replaced clock() with an additional counter.
This also has the advantage that the timeout does not need to be much longer on
much slower GPUs and its simplier when dealing with newer GPUs which appear not
to reset their clocks at the start of launching a kernel.

new code

#define OK() ((OKcounter++ < MAXLOOPS)? 1 : 0)
int OKcounter = 0;
...
  for(i = 0;OK() && Y+i < height;i++) {
    ...
  }

I am currently setting MAXLOOPS to one hundred times bigger than the largest
value expected. Equally well it could just be the largest value.

Thanks again

Bill

Topic		Replies	Views
Clock Cycles of CUDA kernel How to determine the clock cycles...? CUDA Programming and Performance	9	14788	June 27, 2008
kernel call overhead: timing results overhead is large for small # of calls CUDA Programming and Performance	16	7789	March 8, 2013
Does %clock measure actual GPU cycles, or what? CUDA Programming and Performance	5	1590	July 9, 2019
cost of clock64() CUDA Programming and Performance	18	6478	November 23, 2015
Number of GPU clock cycles CUDA Programming and Performance	15	10395	June 16, 2017
timing kernel execution with clock() CUDA Programming and Performance	6	3734	July 6, 2009
Kernel execution overhead CUDA Programming and Performance	2	1159	July 6, 2009
Instruction timings More info than in the guide CUDA Programming and Performance	5	8284	May 21, 2007
clock() doesn't work properly CUDA Programming and Performance	10	6295	July 3, 2009
Loop limit in CUDA kernel ? Too large loop => loop not launched CUDA Programming and Performance	12	6905	June 12, 2011

overhead of calling clock() ?

Related topics