Hello guys,
I try to figure more about the instruction execution speed in G80. In my measurements
I figure out the CUDA needs about 14 clocks per instruction which is really slow and do not
match with the documentation from nvidia. Hopefully someone can find a mistake in my code or approach.
First of all I wrote a simple loop which counts a int value. This code was compiled using -O0 option to avoid the optimization. The inner loop is still required because even
with the -O0 option the ptxas optimize the loop out and sets the “k†value immediately to 1000000.
Here the initial C code :
__global__ void k_002(long * value) {
clock_t beginning = clock();
int i = 0;
int j = 0;
int k = 0;
for (; i < 1000000; i++) {
for (; j < 1; j++) {
k++;
}
j = 0;
}
clock_t ending = clock();
value[0] = ending - beginning;
value[1] = k;
}
the corresponding .ptx code looks like this :
.entry _Z5k_002Pm
{
.reg .u32 $r1,$r2,$r3,$r4,$r5;
.reg .u64 $rd1,$rd2,$rd3,$rd4,$rd5;
.reg .pred $p0,$p1;
.param .u64 __cudaparm__Z5k_002Pl_value;
.loc 15 130 0
$LBB1__Z5k_002Pl:
.loc 17 68 0
mov.u32 %rv1, %clock; //
mov.s32 $r1, %rv1; //
.loc 15 133 0
cvt.s64.s32 $rd1, $r1; //
mov.s32 $r2, 0; //
$Lt_11_11:
//<loop> Loop body line 133, nesting depth: 1, iterations: 1000000
mov.s32 $r3, $r2; //
add.s32 $r4, $r3, 1; //
mov.s32 $r2, $r4; //
mov.s32 $r5, 1000000; //
setp.ne.s32 $p1, $r4, $r5; //
@$p1 bra $Lt_11_11; //
.loc 17 68 0
mov.u32 %rv1, %clock; //
mov.s32 $r1, %rv1; //
.loc 15 148 0
ld.param.u64 $rd2, [__cudaparm__Z5k_002Pl_value]; // id:29 __cudaparm__Z5k_002Pl_value+0x0
cvt.s64.s32 $rd3, $r1; //
sub.s64 $rd4, $rd3, $rd1; //
st.global.s64 [$rd2+0], $rd4; // id:30
.loc 15 149 0
cvt.s64.s32 $rd5, $r4; //
st.global.s64 [$rd2+8], $rd5; // id:31
exit; //
} // _Z5k_002Pl
Using decuda i could see what really is executed on my G80 GTS. Here the code :
000000: 00000001 60004780 mov.b32 $r0, %clock
000008: 30010001 c4100780 shl.u32 $r0, $r0, 0x00000001
000010: 10000005 0403c780 mov.b32 $r1, $r0
000018: 10008009 00000003 mov.b32 $r2, 0x00000000
000020: 301f0205 ec100780 shr.s32 $r1, $r1, 0x0000001f
000028: 20018409 00000003 label0: add.b32 $r2, $r2, 0x00000001
000030: 1000800d 0000f427 mov.b32 $r3, 0x000f4240
000038: 300305fd 6c0147c8 set.ne.s32 $p0|$o127, $r2, $r3
000040: 10005003 00000280 @$p0.ne bra.label label0
000048: 0000000d 60004780 mov.b32 $r3, %clock
000050: 30010611 c4100780 shl.u32 $r4, $r3, 0x00000001
000058: d0010005 0402c780 not.b32 $r1, $r1
000060: 20400801 040007c0 sub.u32 $p0|$r0, $r4, $r0
000068: 301f0815 ec100780 shr.s32 $r5, $r4, 0x0000001f
000070: 30400a05 04004780 addc.u32 $r1, $r5, $r1
000078: 1000c80d 0423c780 mov.b32 $r3, s[0x0010]
000080: d00e0601 a0800780 mov.b64 g[$r3], $r0
000088: 301f0405 ec100780 shr.s32 $r1, $r2, 0x0000001f
000090: 10000401 0403c780 mov.b32 $r0, $r2
000098: 2108e809 00000003 add.b32 $r2, s[0x0010], 0x00000008
0000a0: d00e0401 a0800781 mov.end.b64 g[$r2], $r0
The below peace of microcodes is executing 1000000 times, as long
$r2 != $r3. $r2 is increasing by the ADD instruction. Then the SET instruction
compares the $r2 and $r3 using the NE (!=) comperation operator and sets the $p0
or $o127 predictable register. The $o127 register seems not to be used. Then one line
below the BRA instruction sets the program pointer to the label0 if the $p0
register is false.
[b]
000028: 20018409 00000003 label0: add.b32 $r2, $r2, 0x00000001 // 1 clock
000030: 1000800d 0000f427 mov.b32 $r3, 0x000f4240 // 1 clock
000038: 300305fd 6c0147c8 set.ne.s32 $p0|$o127, $r2, $r3 // 1, 2 or 3 clocks ? set, ne, or ?
000040: 10005003 00000280 @$p0.ne bra.label label0 // 1 clock
[/b]
I would expect that the above microcode loop executes in about 6 clocks per iteration. Not sure here.
The whole code should be ready after 1000000 x 6 clocks plus few hunderts clocks of other overhead/instructions. The nvclock output says that the SP’s are clocking at 1188 Mhz.
[b]
- Shader info –
Clock: 1188.000 MHz
Stream units: 96 (01111110b)
ROP units: 20 (111110b)
– Memory info –
Amount: 640 MB
Type: 320 bit DDR3
Clock: 792.000 Mhz
[/b]
In milliseconds : 1188 Mhz / 6000000 clocks = 198. 1000 ms / 198 = 5 ms. After 5 ms this code
should be ready.
In fact I’m receiving this figures from my hardware :
[b]
Module loaded…
cuFuncSetBlockShape done…
cuParamSetv done…
cuParamSetSize done…
Timer Value 74.271004 ms
cuLaunchGrid done…
kernel output ---- > 88000074 clocks , k_control_value = 1000000
Kernel Time Total - 78 ms
Done
[/b]
This values are for 1 Thread and 1 Block in the Grid. With more threads and blocks in the grid the time and needed clocks increase only a bit – because of SIMD and G80 scheduling.
This is 74 ms agains 5 ms which I expected. Also the clock size of about 88 millions
is about 14 times more then I expected. This 88 millions match the 74 ms,
1188 Mhz / 88 Million Clocks = 13.499, 1000 ms / 13.499 = 74 ms
so the clock value is correct. The question is now what is the G80 doing all the time ? The CUDA documentation says that 1 instruction per clock is executed.
Only the global access takes few hunders clocks. In the loop code only registers are used.
Using the above figures it looks like that one instruction needs about 14 clocks!
(88 mio clocks / 6 mio instructions = 14)
Maybe someone more experienced can bring a bit light in this issue.
thanks,
jj