I would like to know if there is an easy way to determine how many clock cycles are needed to process my CUDA kernel by one thread or at least for a part aof my kernel. I think the Programming guide don’t mentions the clock cycles of all available instructions. Is there perhaps a special CUDA-Profiler mode which gives a more detailed report on the kernel prformance?
The clock() function does return “the real amount” of clock cycles, because this is what one thread actually experiences when executed (usually called “wall-clock time”). This includes the latency for memory operations and is thus the number that matters. It will also show you if your kernel is occupancy-sensitive when varying the number of threads per block.
I think you are looking for the theoretical number of instruction cycles. To get them you need to count the stuff by hand. See the programming manual for the number of cycles spent for the various instruction types. Note however that this will not reflect the true behavior of your kernel as it neglects the concurrency situation it will be in at runtime. That is, a low instruction cycle count kernel can run much slower than one with high count that organizes data differently.
Thought I should add a caveat that should be in the manual under the clock() call - there is 1 hardware clock register per 2 MPs (as one would expect as it is much cheaper to duplicate a counter than bus it all over chip and between chips). These registers do get out of sync. It seems that they often get reset at kernel launch, but that is not certain and I have not been able to work out how to force a counter sync at launch. I have seen hundreds of consecutive launches without a counter reset. These counters are attached to the instruction unit and run at 1/2 clock rate on G80 (there is a left shift in device code) - some operations, esp global memory contention seems to pause the instruction unit clock??? So the results are not all that reliable. Also either there is a hardware crosstalk problem, or the chips on my G80 were not tested properly by Nvidia because the clock registers can get corrupted during kernel execution. Just changing the object code changes the frequency of corruption. 1 bunch of kernels get a clock register zap about once per second when the code is compiled maxrregcount=8 and 10 times as many when compiled maxrregcount=12.
Eric, can you prove what you are saying here by posting a kernel that exhibits this behavior?
Sure, that goes without saying. As the %clock is a u32 register that wraps silently, you cannot time more than ~3 sec (8800GTX at 1.35GHz). Also note that you cannot just take the difference between two calls without checking for the wrap around.
Peter, I will get around to it, but a bit busy - also not posting source code till Nvidia answer some more questions…
If you take %clock as a signed then subtraction of 2 values always gives you the right answer if the difference is < 2^31 (1.45 sec) - one of the features of 2s compliment that is handy in these situations.
That old trick doesn’t work in CUDA as nvopencc recognizes that the special register %clock is unsigned and it therefore inserts a cvt instruction loosing the high bit (it is big endian). You will have to code in PTX if you want to exploit that trick.
The correct thing to do is something along the lines
uint start = clock();
... compute (less than 3 sec) ....
uint end = clock();
if (end > start)
time = end - start;
else
time = end + (0xffffffff - start);
Now I presume mov.s32 is identical to mov.u32 so looks OK to me (I assume ptxas deletes this instruction) - since signed and unsigned subtraction is bit identical (ignoring over/underflow) in 2s compliment the result is correct unsigned provided you know which arg is bigger than the other and the difference is < 2^31. Casting on the host won’t cause any conversions so all one needs to watch is not to convert to signed on the the device. I keep the device side unsigned and copy over to a signed array on the host side for calculation.
Eric
ed: You had me worried that device code might behave differently to host code here - if I store %clock into a signed I get
So the compiler seems to think it is signed anyway.
ed: Thought I should simplify: you can just subtract the results for 2 clock() calls, if you know which is bigger and get the right answer unsigned up to 2^32 == 2.9 secs on 8800GTX. Peters code is correct if you change 2^32-1 to 0 and then you obviously don’t need the else part. On 64 bit one needs to mask with 2^32-1 or cast to a uint to get the right answer.