Very weird behaviour

Basically the following code executes f[0][0]++; f[15][2]++; for 10000000 cycles. I used events to time it to calculate the clocks it takes per cycle. The very weird thing is - when un-comment region I and remove region II, the clocks/cycle is roughly 20. But when I remove region I and insert region II, the clocks/cycle jumps to 70+. I’m just confused. How would such a small for loop have such a huge effect on the execution time?

#define org f[0][0]++; f[15][2]++;

#define _10(o) o o o o o o o o o o

__global__ void test1(float *dev)

{

	float f[16][3];

	for(int i=0; i<16; i++)

		for(int j=0; j<3; j++)

			f[i][j] = 0;

	for(int i = 0; i<10000; i++) //10000 * 1000 = 10000000

	{_10(_10(_10(org)))}

	//*dev = f[0][0] + f[15][2]; //region I

	//below is region II

	float result=0;

	for(int i=0; i<16; i++)

		for(int j=0; j<3; j++)

			result += f[i][j];

	*dev = result;

}

Anyone knows the reason here? I don’t have a working disassembler here so I have no idea what the compiler is doing…

I have attached the source file in case you want to build it and disassemble it.

My super-slow internet is preventing me from uploading the file, which is actually very small…

I suspect you will be able glean 90% of the information you need directly from PTX without disassembling anything.

hmm… I have the ptx file, but what’s the command line to get such information?

What happens is that with only region I but not region II only two elements of the matrix are ever read, so the others are completely optimized away by the compiler. The remaining two are then kept in registers, which makes the loop quite fast.

With region II (but not region I), calculation is still done in registers, but the result is written back after every iteration.

weird… my nvdis works for this cubin… so yeah it is as you said… is there a way for me to ensure that all the 48 elements are in register? I actually set the MaxRegCount to 256 and disabled optimization

Unroll all loops over indices:

#define org f[0][0]++; f[15][2]++;

#define _10(o) o o o o o o o o o o

__global__ void test1(float *dev)

{

        float f[16][3];

#pragma unroll

        for(int i=0; i<16; i++)

#pragma unroll

                for(int j=0; j<3; j++)

                        f[i][j] = 0;

        for(int i = 0; i<10000; i++) //10000 * 1000 = 10000000

        {_10(_10(_10(org)))}

        //*dev = f[0][0] + f[15][2]; //region I

        //below is region II

        float result=0;

#pragma unroll

        for(int i=0; i<16; i++)

#pragma unroll

                for(int j=0; j<3; j++)

                        result += f[i][j];

        *dev = result;

}

It is an ascii file. Just open it up in an editor. You will see the code generated by the C compiler there. If there are compiler level optimizations which are making a difference, they will be visible in the PTX code. There can be more subtle optimizations and instruction substitution going on at the assembler level that only disassembly will show, but often the PTX can be sufficient to see what is happening.

Thanks a lot! you guys are really helpful!

Thanks for the enlightenment!

When I only do f[0][0]++; , without f[15][2]++; in the loop, the time I had was actually 16.07+. Yet when I add the f[15][2]++; in the loop it becomes 20+. The launch configuration I used was always <<<1,1>>>. I was actually expecting the time to remain the same after a second operation was added because I’m one GTX 460 which has 2 dispatch units for one warp scheduler. The second operation is independent from the first, so at every clock the first dispatch unit should issue instruction for the first operation, while the second dispatch unit issues instruction for the second operations. So in this way the first operation and the second operation should be executing in parallel at all times. Then why would the timing increase?

I guess what is happening here is that the instruction cache overflows when you add the second increment to the loop. I don’t know how large the instruction cache is for Fermi devices, but for the GT200 class devices it was 8 kb. Assuming Nvidia has kept the size the same for Fermi, this would nicely explain your finding: your loop body compiles to 1000 instructions of 8 bytes each, plus the loop instructions themselves, so it just fits nicely into the 8 kb of L2 instruction cache. If you add a second increment to the loop body, it grows to 2000 instructions of 8 bytes each (plus loop overhead), overflowing the 8 kb cache, so that all instructions need to be fetched from L3 cache (probably with the help of some prefetching).

Try reducing the loop body to half the number of instructions and see what results you get.

Compared to, say, GTX 470, it has twice the dispatch units, but only 50% more execution cores, so doubling the amount of work will make dispatch units compete for execution cores.

Yes, but (if it weren’t for the instruction cache effects) even with no additional execution cores doubling the amount of work should not increase execution time as the second set of increments is independent, so it can execute during cycles 3 and 4 of the 16 cycle latency of the first set of increments.

Im running on GTX 460 with 2 dispatch units per warp scheduler… So the instructions for the two operations should be issued in parallel, right? They both run at the same pace.

I haven’t had time to decrease the number of instructions yet… My laptop is running some very very long HDD check… Will test it, hopefully, in 2 days

Yes, I guess they are issued in parallel. But even if they weren’t, the timing wouldn’t change for the reason I explained.

I tried that and it does not make any difference. Even shorter versions take 19 or 20 clocks to run. So the instruction cache theory is ruled out too.

So always get 19 or 20 clocks, whether there are one or two increments in the loop body and whether it is unrolled 500 or 1000 times? That seems a lot more plausible, as 16 cycles for the add seemed too low anyway.

One increment - 16 clocks

Two increments - 19 clocks

It’s somewhere in the programming guide or ptx manual that arithmetic latency ranges between 16-24 cycles.

Something even more strange has appeared… as hamster143 already showed, even when the number of instructions in the loop is reduced to 200, with 2 increments, the time still stays above 16, at 19 (This is for sm_20 on my GTX460, for sm_21, it’s 23).

However, when I do 3 increments, it jumps to 29 (30 when using sm_21), yet when I add it to 4, it goes down to 22 (23 for sm_21).

Is it a problem of alignment or something?
Anyway, ptxas is surely screwed up for sm_21… or maybe for sm_20 as well

look at this: I have hundreds of independent instructions before this, yet it still puts the add for the counter, the set and the bra so closely together. it just assumes that there will be a lot of threads running (I guess this assumption is appropriate for most of the time, but the people at NVidia certainly could do better).

00000c50: 5000000010211c00 add rn f32 $r4 $r2 $r4
00000c58: 500000000c20dc00 add rn f32 $r3 $r2 $r3
00000c60: 4800c00004001c03 add b32 $r0 $r0 0x1
00000c68: 5000000018219c00 add rn f32 $r6 $r2 $r6
00000c70: 5000000014215c00 add rn f32 $r5 $r2 $r5
00000c78: 5000000010211c00 add rn f32 $r4 $r2 $r4
00000c80: 500000000c20dc00 add rn f32 $r3 $r2 $r3
00000c88: 1a8ec61a8001dc23 set $p0 ne s32 $r0 0x186a0
00000c90: 5000000018219c00 add rn f32 $r6 $r2 $r6
00000c98: 5000000014215c00 add rn f32 $r5 $r2 $r5
00000ca0: 5000000010211c00 add rn f32 $r4 $r2 $r4
00000ca8: 500000000c20dc00 add rn f32 $r3 $r2 $r3
00000cb0: 5000000018219c00 add rn f32 $r6 $r2 $r6
00000cb8: 5000000014215c00 add rn f32 $r5 $r2 $r5
00000cc0: 5000000010211c00 add rn f32 $r4 $r2 $r4
00000cc8: 500000000c20dc00 add rn f32 $r3 $r2 $r3
00000cd0: 4003ffcda00001e7 $p0 bra 0x40