Does anyone know how much it costs to use the clock() function in a kernel?
My reading of the PTX parallel thread exeecution manual (ISA v3.1) suggests
it is very low indeed, essentially a register move. So should be about as fast
as incrementing an int counter.
Is this true if you call clock() at the C++ CUDA level?
The reason for asking is I’m using clock() to detect (and then abort)
infinite loops. Essentially I use (clock() < MAXTICS), where MAXTICS is a large positive
integer (eg 2000000000). This works but calling clock() often introduces an
appreciable overhead (approx doubles kernel time).
Is the loop body large? If it is only a few instructions, it wouldn’t be surprising that introducing a few more instructions doubles the time.
Otherwise, it is not unusual to see that a minor alteration of the source code results in a major effect on the generated code. To see if this is the case, I recommend using cuobjdump.
If you’re interested, here’s a comment about having a fast clock function in hardware https://www.youtube.com/watch?v=J9kobkqAicU. At 22:30, Burton was lamenting about the lack of a user readable clock on today’s processors. I’m not sure what he meant since x86 does have a time stamp counter, but it’s good to see NVIDIA included one :)
Dear vvolkov and Uncle Joe,
Thank you for your helpful replies. This is indeed close to what I am seeing
(using nvcc --keep and cuobjdump (5.0 V0.2.1221) BTW I am compiling with -arch sm_13.
A slightly simplified example is:
for(i = 0;clock() < 2000000000 && i<=10;i++)
It appears that nvcc is treating 2000000000 (2billion) as an unsigned long and so whilst
only one PTX instruction is being used to read the clock the compiler generates a total
of six instructions for the first part of the loop conditional (ie have we timed out yet)
It turns out that nVidia defines clock() to return an item of type clock_t
HOWEVER gnu C file time.h 7.23 and types.h typedefs clock_t as long int
Hence nvcc’s insistence on converting 64 bits and doing a 64 bit signed comparison.
If clock is coerced to (unsigned int) the for loop plus time out becomes
Apart from the actual instruction count (which should be deduced from cuobjdump -sass output, not by counting intermediate PTX instructions), clock() also has a cost in that the compiler treats it as a barrier. I.e. Use of clock() may lead to inferior instruction scheduling and potentially even inhibit other optimizations.
(Thanks to Vasily Volkov for pointing this out).
For detection of infinite loops I always use loop counters, both for their deterministic behavior and efficiency. Downside is they use an additional register and the need to be more careful with nested loops.
It’s a bit weird now. You do only 10 iterations. The total overhead of quitting on clock() should be negligible.
I’d check if there are any other differences in the generated code. If yes, you might want to play with where you place clock() - it may affect the code a lot.
Dear Tera,
Just to confirm I have now replaced clock() with an additional counter.
This also has the advantage that the timeout does not need to be much longer on
much slower GPUs and its simplier when dealing with newer GPUs which appear not
to reset their clocks at the start of launching a kernel.