Hi there
Im working on some model related problems regarding predictability of cuda code. Im trying to measure the expected time used by add calculations in the small kernel:
#define REPETITIONS 1000
#define ADDOP(name,op)
void global name##Kernel(float* a, float* b, float* c, int N,int M ) {
shared int ts;
int t = 1;
for(int i = 1; i < REPETITIONS+1; i++) { \
t = t op i;
} \
ts = t;
}
I then define multiple functions as:
AOP(add10,+i+i+i+i+i+i+i+i+i+)
AOP(add9,+i+i+i+i+i+i+i+i+)
AOP(add8,+i+i+i+i+i+i+i+)
AOP(add7,+i+i+i+i+i+i+)
AOP(add6,+i+i+i+i+i+)
AOP(add5,+i+i+i+i+)
AOP(add4,+i+i+i+)
AOP(add3,+i+i+)
AOP(add2,+i+)
AOP(add,+)
As an example the AOP(add,+) translates to:
…
$Lt_0_7:
// Loop body line 6, nesting depth: 1, iterations: 1000
.loc 15 10 0
add.s32 %r2, %r1, %r2; //
add.s32 %r1, %r1, 1; //
mov.u32 %r3, 1001; //
setp.ne.s32 %p1, %r1, %r3; //
@%p1 bra $Lt_0_7; //
.loc 15 12 0
st.shared.s32 [ts], %r2; // id:17 ts+0x0
.loc 15 13 0
exit;
…
whereas the AOP(add10,+i+i+i+i+i+i+i+i+i+) translates to:
…
$Lt_0_7:
// Loop body line 6, nesting depth: 1, iterations: 1000
.loc 15 10 0
add.s32 %r3, %r1, %r1; //
add.s32 %r4, %r1, %r1; //
add.s32 %r5, %r1, %r1; //
add.s32 %r6, %r1, %r1; //
add.s32 %r7, %r1, %r2; //
add.s32 %r8, %r1, %r7; //
add.s32 %r9, %r6, %r8; //
add.s32 %r10, %r5, %r9; //
add.s32 %r11, %r4, %r10; //
add.s32 %r2, %r3, %r11; //
add.s32 %r1, %r1, 1; //
mov.u32 %r12, 1001; //
setp.ne.s32 %p1, %r1, %r12; //
@%p1 bra $Lt_0_7; //
.loc 15 12 0
st.shared.s32 [ts], %r2; // id:26 ts+0x0
.loc 15 13 0
exit;
…
Thus the first contains 2 add instructions in the loop, whereas the second contains 11.
These 10 functions has then been executed using a block size of 1 to make it easier to reason about timings, however what it can not explain is that every other add operation seems to cause no extra time penalty, -please see the attached figure. I am sure this is some architecture thing that I have missed, so if somebody has the explaination for this behaviour, I will be thankful :)
Kind Regards Toke
test.pdf (13.7 KB)