Clock Cycles of CUDA kernel How to determine the clock cycles...?

christoph · July 5, 2007, 7:43am

Hi everybody,

I would like to know if there is an easy way to determine how many clock cycles are needed to process my CUDA kernel by one thread or at least for a part aof my kernel. I think the Programming guide don’t mentions the clock cycles of all available instructions. Is there perhaps a special CUDA-Profiler mode which gives a more detailed report on the kernel prformance?

Thanks in advance and best regards,
Christoph

Simon_Green · July 5, 2007, 10:55am

Have you looked at the clock() function? This returns the value of a counter that is incremented every clock cycle for each thread.

See the programming guide, section 4.3.3:
[url=“CUDA Toolkit Documentation”]http://developer.download.nvidia.com/compu...g_Guide_1.0.pdf[/url]

christoph · July 5, 2007, 11:52am

Thanks for the tip but I think thats not what I need. Here is a quote from the programming guide:

Notice the last sentence! clock() did not report the real amount of cycles your code need on the GPU.

Are there other suggestions how to get the real amount of clock cycles?

Thanks in advance and best regards,

Christoph

prkipfer · July 5, 2007, 1:21pm

The clock() function does return “the real amount” of clock cycles, because this is what one thread actually experiences when executed (usually called “wall-clock time”). This includes the latency for memory operations and is thus the number that matters. It will also show you if your kernel is occupancy-sensitive when varying the number of threads per block.

I think you are looking for the theoretical number of instruction cycles. To get them you need to count the stuff by hand. See the programming manual for the number of cycles spent for the various instruction types. Note however that this will not reflect the true behavior of your kernel as it neglects the concurrency situation it will be in at runtime. That is, a low instruction cycle count kernel can run much slower than one with high count that organizes data differently.

Peter

osiris1 · July 9, 2007, 1:56am

Thought I should add a caveat that should be in the manual under the clock() call - there is 1 hardware clock register per 2 MPs (as one would expect as it is much cheaper to duplicate a counter than bus it all over chip and between chips). These registers do get out of sync. It seems that they often get reset at kernel launch, but that is not certain and I have not been able to work out how to force a counter sync at launch. I have seen hundreds of consecutive launches without a counter reset. These counters are attached to the instruction unit and run at 1/2 clock rate on G80 (there is a left shift in device code) - some operations, esp global memory contention seems to pause the instruction unit clock??? So the results are not all that reliable. Also either there is a hardware crosstalk problem, or the chips on my G80 were not tested properly by Nvidia because the clock registers can get corrupted during kernel execution. Just changing the object code changes the frequency of corruption. 1 bunch of kernels get a clock register zap about once per second when the code is compiled maxrregcount=8 and 10 times as many when compiled maxrregcount=12.

So don’t use clock() for long duration timing.

Eric

prkipfer · July 9, 2007, 9:21am

Eric, can you prove what you are saying here by posting a kernel that exhibits this behavior?

Sure, that goes without saying. As the %clock is a u32 register that wraps silently, you cannot time more than ~3 sec (8800GTX at 1.35GHz). Also note that you cannot just take the difference between two calls without checking for the wrap around.

Peter

osiris1 · July 11, 2007, 1:15am

Peter, I will get around to it, but a bit busy - also not posting source code till Nvidia answer some more questions…

If you take %clock as a signed then subtraction of 2 values always gives you the right answer if the difference is < 2^31 (1.45 sec) - one of the features of 2s compliment that is handy in these situations.

Eric

prkipfer · July 11, 2007, 9:15am

That old trick doesn’t work in CUDA as nvopencc recognizes that the special register %clock is unsigned and it therefore inserts a cvt instruction loosing the high bit (it is big endian). You will have to code in PTX if you want to exploit that trick.

The correct thing to do is something along the lines

uint start = clock();

... compute (less than 3 sec) ....

uint end = clock();

if (end > start)

  time = end - start;

else

  time = end + (0xffffffff - start);

Peter

osiris1 · July 15, 2007, 7:13am

Thanks, Peter - if you store %clock into an unsigned then the ptx looks like:

      mov.u32         $r1, %clock;            //

        mov.s32         $r2, $r1;               //

        .loc    14      17      0

        st.shared.u32   [shared], $r2;  //  id:20 shared+0x0

Now I presume mov.s32 is identical to mov.u32 so looks OK to me (I assume ptxas deletes this instruction) - since signed and unsigned subtraction is bit identical (ignoring over/underflow) in 2s compliment the result is correct unsigned provided you know which arg is bigger than the other and the difference is < 2^31. Casting on the host won’t cause any conversions so all one needs to watch is not to convert to signed on the the device. I keep the device side unsigned and copy over to a signed array on the host side for calculation.

Eric

ed: You had me worried that device code might behave differently to host code here - if I store %clock into a signed I get

       mov.u32         $r1, %clock;            //

        mov.s32         $r2, $r1;               //

        .loc    14      17      0

        st.shared.s32   [shared], $r2;  //  id:20 shared+0x0

Which is the same… also I think your code suggestion will give the wrong answer by 1 tick in the else part.

And just for interest on 64 bit this trick still works (clock_t is a long):

       mov.u32         $r1, %clock;            //

        mov.s32         $r2, $r1;               //

        .loc    14      17      0

        cvt.s64.s32     $rd1, $r2;              //

        st.shared.s64   [shared], $rd1; //  id:20 shared+0x0

So the compiler seems to think it is signed anyway.

ed: Thought I should simplify: you can just subtract the results for 2 clock() calls, if you know which is bigger and get the right answer unsigned up to 2^32 == 2.9 secs on 8800GTX. Peters code is correct if you change 2^32-1 to 0 and then you obviously don’t need the else part. On 64 bit one needs to mask with 2^32-1 or cast to a uint to get the right answer.

sandeepan1986 · June 27, 2008, 12:35pm

see the clock example in the coda sample projects…it shud definitely help you