G80 - 14 clocks per Instruction ?

Hello guys,

I try to figure more about the instruction execution speed in G80. In my measurements

I figure out the CUDA needs about 14 clocks per instruction which is really slow and do not

match with the documentation from nvidia. Hopefully someone can find a mistake in my code or approach.

First of all I wrote a simple loop which counts a int value. This code was compiled using -O0 option to avoid the optimization. The inner loop is still required because even

with the -O0 option the ptxas optimize the loop out and sets the “k” value immediately to 1000000.

Here the initial C code :

__global__ void k_002(long * value) {

	

	clock_t beginning = clock();

	

	int i = 0;

	int j = 0;

	int k = 0;

	

	for (; i < 1000000; i++) {

  for (; j < 1; j++) {

  	k++;

  }

  j = 0;

	}

	

	clock_t ending = clock();  

	

	value[0] = ending - beginning;

	value[1] = k;

}

the corresponding .ptx code looks like this :

.entry _Z5k_002Pm

{

    .reg .u32 $r1,$r2,$r3,$r4,$r5;

    .reg .u64 $rd1,$rd2,$rd3,$rd4,$rd5;

    .reg .pred $p0,$p1;

    .param .u64 __cudaparm__Z5k_002Pl_value;

    .loc    15      130     0

$LBB1__Z5k_002Pl:

    .loc    17      68      0

    mov.u32         %rv1, %clock;           //  

    mov.s32         $r1, %rv1;              //  

    .loc    15      133     0

    cvt.s64.s32     $rd1, $r1;              //  

    mov.s32         $r2, 0;                 //  

$Lt_11_11:

//<loop> Loop body line 133, nesting depth: 1, iterations: 1000000

    mov.s32         $r3, $r2;               //  

    add.s32         $r4, $r3, 1;            //  

    mov.s32         $r2, $r4;               //  

    mov.s32         $r5, 1000000;           //  

    setp.ne.s32     $p1, $r4, $r5;          //  

    @$p1 bra        $Lt_11_11;              //  

    .loc    17      68      0

    mov.u32         %rv1, %clock;           //  

    mov.s32         $r1, %rv1;              //  

    .loc    15      148     0

    ld.param.u64    $rd2, [__cudaparm__Z5k_002Pl_value];    //  id:29 __cudaparm__Z5k_002Pl_value+0x0

    cvt.s64.s32     $rd3, $r1;              //  

    sub.s64         $rd4, $rd3, $rd1;       //  

    st.global.s64   [$rd2+0], $rd4; //  id:30

    .loc    15      149     0

    cvt.s64.s32     $rd5, $r4;              //  

    st.global.s64   [$rd2+8], $rd5; //  id:31

    exit;                           //  

    } // _Z5k_002Pl

Using decuda i could see what really is executed on my G80 GTS. Here the code :

000000: 00000001 60004780 mov.b32 $r0, %clock

000008: 30010001 c4100780 shl.u32 $r0, $r0, 0x00000001

000010: 10000005 0403c780 mov.b32 $r1, $r0

000018: 10008009 00000003 mov.b32 $r2, 0x00000000

000020: 301f0205 ec100780 shr.s32 $r1, $r1, 0x0000001f

000028: 20018409 00000003 label0: add.b32 $r2, $r2, 0x00000001

000030: 1000800d 0000f427 mov.b32 $r3, 0x000f4240

000038: 300305fd 6c0147c8 set.ne.s32 $p0|$o127, $r2, $r3

000040: 10005003 00000280 @$p0.ne bra.label label0

000048: 0000000d 60004780 mov.b32 $r3, %clock

000050: 30010611 c4100780 shl.u32 $r4, $r3, 0x00000001

000058: d0010005 0402c780 not.b32 $r1, $r1

000060: 20400801 040007c0 sub.u32 $p0|$r0, $r4, $r0

000068: 301f0815 ec100780 shr.s32 $r5, $r4, 0x0000001f

000070: 30400a05 04004780 addc.u32 $r1, $r5, $r1

000078: 1000c80d 0423c780 mov.b32 $r3, s[0x0010]

000080: d00e0601 a0800780 mov.b64 g[$r3], $r0

000088: 301f0405 ec100780 shr.s32 $r1, $r2, 0x0000001f

000090: 10000401 0403c780 mov.b32 $r0, $r2

000098: 2108e809 00000003 add.b32 $r2, s[0x0010], 0x00000008

0000a0: d00e0401 a0800781 mov.end.b64 g[$r2], $r0

The below peace of microcodes is executing 1000000 times, as long

$r2 != $r3. $r2 is increasing by the ADD instruction. Then the SET instruction

compares the $r2 and $r3 using the NE (!=) comperation operator and sets the $p0

or $o127 predictable register. The $o127 register seems not to be used. Then one line

below the BRA instruction sets the program pointer to the label0 if the $p0

register is false.

[b]

000028: 20018409 00000003 label0: add.b32 $r2, $r2, 0x00000001 // 1 clock

000030: 1000800d 0000f427 mov.b32 $r3, 0x000f4240 // 1 clock

000038: 300305fd 6c0147c8 set.ne.s32 $p0|$o127, $r2, $r3 // 1, 2 or 3 clocks ? set, ne, or ?

000040: 10005003 00000280 @$p0.ne bra.label label0 // 1 clock

[/b]

I would expect that the above microcode loop executes in about 6 clocks per iteration. Not sure here.

The whole code should be ready after 1000000 x 6 clocks plus few hunderts clocks of other overhead/instructions. The nvclock output says that the SP’s are clocking at 1188 Mhz.

[b]

  • Shader info –

Clock: 1188.000 MHz

Stream units: 96 (01111110b)

ROP units: 20 (111110b)

– Memory info –

Amount: 640 MB

Type: 320 bit DDR3

Clock: 792.000 Mhz

[/b]

In milliseconds : 1188 Mhz / 6000000 clocks = 198. 1000 ms / 198 = 5 ms. After 5 ms this code

should be ready.

In fact I’m receiving this figures from my hardware :

[b]

Module loaded…

cuFuncSetBlockShape done…

cuParamSetv done…

cuParamSetSize done…

Timer Value 74.271004 ms

cuLaunchGrid done…

kernel output ---- > 88000074 clocks , k_control_value = 1000000

Kernel Time Total - 78 ms

Done

[/b]

This values are for 1 Thread and 1 Block in the Grid. With more threads and blocks in the grid the time and needed clocks increase only a bit – because of SIMD and G80 scheduling.

This is 74 ms agains 5 ms which I expected. Also the clock size of about 88 millions

is about 14 times more then I expected. This 88 millions match the 74 ms,

1188 Mhz / 88 Million Clocks = 13.499, 1000 ms / 13.499 = 74 ms

so the clock value is correct. The question is now what is the G80 doing all the time ? The CUDA documentation says that 1 instruction per clock is executed.

Only the global access takes few hunders clocks. In the loop code only registers are used.

Using the above figures it looks like that one instruction needs about 14 clocks!

(88 mio clocks / 6 mio instructions = 14)

Maybe someone more experienced can bring a bit light in this issue.

thanks,

jj

I can’t address your questions specifically, but I can make some general comments.

Periodically, people have been confused in the forums by the apparent poor performance of CUDA in single-block, single-thread microbenchmarks. It’s clear that this is a non-linear regime for the hardware, and you can’t extrapolate single-thread performance up to real-world problems. A program with many blocks and many threads per block runs more than 128x times faster (assuming your card has 128 stream processors) than the single thread version.

There are many reasons for this, but I think in your case, you are discovering the pipelining in the stream processors. The warp size on all current cards is 32, despite all current cards having only 8 stream processors per multiprocessor. This is because of the clock rate difference between the instruction decoder and the stream processors, as well as the pipelining inside the stream processors themselves. Standard CPUs are stuck trying to analyze and pipeline a single, linear instruction stream, dealing with the dependencies between instructions through clever tricks, or inserting no-op bubbles into the pipeline.

In the CUDA architecture, the warp being the unit of instruction scheduling gives you inherent parallelism you can exploit in the pipeline. The hardware can take the 32 threads in the warp and fold them up so that each of the 8 stream processors has 4 identical instructions (each for a different thread) pipelined for execution. And, since threads are independent, you know there will be no pipeline hazards to worry about (well, until you queue up the next instruction).

I would suggest you do these tests with a full warp of 32 threads to determine the average instruction rate. With a single thread, I would expect the pipeline to be 75% empty. That doesn’t entirely explain 14 clocks per instruction, but more testing might reveal the rest.

It is also advised when measuring kernel runtime to start the kernel twice and to measure the second kernel run. The first kernel run will have some overhead which might interfere with your timings.

Like Siebert wrote, you’re only executing 1 thread and 1 block, whereas the CUDA manual states that CUDA will always execute 32 threads in 4 clock cycles. So even when you’re running 1 thread, it will use 4 clocks per instruction. I’m pretty sure the running the same program with 32 threads and 1 block will results in the same amount of clocks. (Please try and report the results.)

There’s also the matter of dependent instructions: the set instruction uses the $r3 instruction, which is assigned by the mov instruction before it. Smart compilers will try to introduce non-dependent instructions, but your main loop is so simple (too simple, really) that there are no independent instructions to find.

It’s not unreasonable to expect that this register dependency adds a number of penalty cycles for going through the register file. (In fact, it’s even possible that 1 independent instruction is not enough and that the set instruction is dependent on the add instructions!)

And finally there is the branch instruction, which can also lead to penalty cycles.

If you want to check out the theoretical instruction throughput, it would be better to rewrite this program with unrolled loops and a bit more calculations, so that the compiler can do better, non-dependent, scheduling.

Like this:

int i = 0;

int j = 0;

int k = 0;

int l = 0;

int m = 0;

for (; i < 1000000; i++) {

  for (; j < 1; j++) {

    k++;

    l++;

    m++;

   k++;

    l++;

    m++;

   k++;

    l++;

    m++;

   k++;

    l++;

    m++;

   k++;

    l++;

    m++;

   }  

   j = 0;

}

hello guys,

thanks for your comments.

@TomV - yesterday i did some measurements with the same kernel code
but more threads (512) within the block. I did not have time to fully validate the results
but i looks like that 32 instructions per clock are executed which match
with your assumption.

Maybe we are here observing two things. First the 4 clock per instruction, because
of the 8 SP’s in MP and the execution of instructiosn within a 32 warp. And then
the “empty” pipelines like Seibert describes. Together I could see this 14 clocks/instructions.

I will post my results as soon I have them.

thanks & regards,
jj