Add performance

Hi there

Im working on some model related problems regarding predictability of cuda code. Im trying to measure the expected time used by add calculations in the small kernel:

#define REPETITIONS 1000
#define ADDOP(name,op)
void global name##Kernel(float* a, float* b, float* c, int N,int M ) {
shared int ts;
int t = 1;
for(int i = 1; i < REPETITIONS+1; i++) { \
t = t op i;
} \
ts = t;
}

I then define multiple functions as:
AOP(add10,+i+i+i+i+i+i+i+i+i+)
AOP(add9,+i+i+i+i+i+i+i+i+)
AOP(add8,+i+i+i+i+i+i+i+)
AOP(add7,+i+i+i+i+i+i+)
AOP(add6,+i+i+i+i+i+)
AOP(add5,+i+i+i+i+)
AOP(add4,+i+i+i+)
AOP(add3,+i+i+)
AOP(add2,+i+)
AOP(add,+)

As an example the AOP(add,+) translates to:

$Lt_0_7:
// Loop body line 6, nesting depth: 1, iterations: 1000
.loc 15 10 0
add.s32 %r2, %r1, %r2; //
add.s32 %r1, %r1, 1; //
mov.u32 %r3, 1001; //
setp.ne.s32 %p1, %r1, %r3; //
@%p1 bra $Lt_0_7; //
.loc 15 12 0
st.shared.s32 [ts], %r2; // id:17 ts+0x0
.loc 15 13 0
exit;

whereas the AOP(add10,+i+i+i+i+i+i+i+i+i+) translates to:

$Lt_0_7:
// Loop body line 6, nesting depth: 1, iterations: 1000
.loc 15 10 0
add.s32 %r3, %r1, %r1; //
add.s32 %r4, %r1, %r1; //
add.s32 %r5, %r1, %r1; //
add.s32 %r6, %r1, %r1; //
add.s32 %r7, %r1, %r2; //
add.s32 %r8, %r1, %r7; //
add.s32 %r9, %r6, %r8; //
add.s32 %r10, %r5, %r9; //
add.s32 %r11, %r4, %r10; //
add.s32 %r2, %r3, %r11; //
add.s32 %r1, %r1, 1; //
mov.u32 %r12, 1001; //
setp.ne.s32 %p1, %r1, %r12; //
@%p1 bra $Lt_0_7; //
.loc 15 12 0
st.shared.s32 [ts], %r2; // id:26 ts+0x0
.loc 15 13 0
exit;

Thus the first contains 2 add instructions in the loop, whereas the second contains 11.

These 10 functions has then been executed using a block size of 1 to make it easier to reason about timings, however what it can not explain is that every other add operation seems to cause no extra time penalty, -please see the attached figure. I am sure this is some architecture thing that I have missed, so if somebody has the explaination for this behaviour, I will be thankful :)

Kind Regards Toke
test.pdf (13.7 KB)

Please check with decuda, any even remotely useful optimizer should realize it is calculating the same thing 4 times here - the fact that it does not makes me conclude that probably ptxas is responsible for that.

Benchmarking with code that can be trivially optimized in probably 50 different ways is a bad idea anyway unless you want to find out how good the compiler is (and we all can answer that: good enough, but not very good, because that’s what all compilers are except the horrible ones).

Actually the problem is that the program executes too slow. I just did another experiment using the following:

for(int i = 1; i < REPETITIONS+1; i++) {

clock_t start = clock();

t = t+ i;

finish = clock()-start;

}

//store last result

The CUDA reference states that the addition takes 4 cycles, but it get the following results, where the first column is the number of add operations and the second is the value of finish:

1 36

2 58

3 64

4 82

5 86

6 106

7 110

8 130

9 134

10 154

The relative difference shows, that by inserting an add operation increases the number of cycles by either 4 (as the reference states) or 20 (which i cannot explain).

Kind Regards Toke

How many threads/warps are you running? It won’t give you any useful numbers if you have more than one warp, so much is sure.
Only one warp means memory latency has a heavy toll on speed though.

Im using a grid of 4 by 4, with a block size of 1. Thus I have more MP’s than elements in my grid, so I assume that I use at most one warp per MP. Is there another way of ensuring a single warp?

Kind Regards Toke

Read after write dependencies are not fully masked until you have at least 6 warps on each MP. This likely is another part of the cause for the timings you have in addition to the other reasons mentioned already.

This is a bad assumption. You cannot assume anything about scheduling on the GPU.

Yeah, this makes perfect sense, but i still want to be able to explain why my add operations alternate between 4 and 20 cycles. So in the case of 20 cycles, it seems like the SP stalls due to the pipeline?, the add instructions should take 4 cycles meaning that the pipeline has depth ~ 16?

This I do not believe, -maybe my assumption is wrong, but i do indeed hope that one can make assumptions regarding the scheduling policies.

So far you only have that the your measurements alternate between 4 and 20 cycles. I don’t know why you are using a 2D grid, while I assume this will be mapped just into a single warp, I do not know.

If you have more than one warp you could measure 20 cycles even though it actually used only 4, just because some other warps were executed in-between.

Given the way the compiler seems to do the calculation (repeatedly calculating (i+i) and summing up the result of that), I’d guess that when you have an odd number of operations you have one more read-after-write dependency.

You could use decuda and count how many you actually have for different kernels.