Very weird behaviour

hyqneuron · March 13, 2011, 5:14pm

Basically the following code executes f[0][0]++; f[15][2]++; for 10000000 cycles. I used events to time it to calculate the clocks it takes per cycle. The very weird thing is - when un-comment region I and remove region II, the clocks/cycle is roughly 20. But when I remove region I and insert region II, the clocks/cycle jumps to 70+. I’m just confused. How would such a small for loop have such a huge effect on the execution time?

#define org f[0][0]++; f[15][2]++;

#define _10(o) o o o o o o o o o o

__global__ void test1(float *dev)

{

	float f[16][3];

	for(int i=0; i<16; i++)

		for(int j=0; j<3; j++)

			f[i][j] = 0;

	for(int i = 0; i<10000; i++) //10000 * 1000 = 10000000

	{_10(_10(_10(org)))}

	//*dev = f[0][0] + f[15][2]; //region I

	//below is region II

	float result=0;

	for(int i=0; i<16; i++)

		for(int j=0; j<3; j++)

			result += f[i][j];

	*dev = result;

}

Anyone knows the reason here? I don’t have a working disassembler here so I have no idea what the compiler is doing…

I have attached the source file in case you want to build it and disassemble it.

hyqneuron · March 13, 2011, 5:29pm

My super-slow internet is preventing me from uploading the file, which is actually very small…

avidday · March 13, 2011, 5:36pm

I suspect you will be able glean 90% of the information you need directly from PTX without disassembling anything.

hyqneuron · March 13, 2011, 5:42pm

hmm… I have the ptx file, but what’s the command line to get such information?

tera · March 13, 2011, 5:45pm

What happens is that with only region I but not region II only two elements of the matrix are ever read, so the others are completely optimized away by the compiler. The remaining two are then kept in registers, which makes the loop quite fast.

With region II (but not region I), calculation is still done in registers, but the result is written back after every iteration.

hyqneuron · March 13, 2011, 6:00pm

weird… my nvdis works for this cubin… so yeah it is as you said… is there a way for me to ensure that all the 48 elements are in register? I actually set the MaxRegCount to 256 and disabled optimization

tera · March 13, 2011, 6:07pm

Unroll all loops over indices:

#define org f[0][0]++; f[15][2]++;

#define _10(o) o o o o o o o o o o

__global__ void test1(float *dev)

{

        float f[16][3];

#pragma unroll

        for(int i=0; i<16; i++)

#pragma unroll

                for(int j=0; j<3; j++)

                        f[i][j] = 0;

        for(int i = 0; i<10000; i++) //10000 * 1000 = 10000000

        {_10(_10(_10(org)))}

        //*dev = f[0][0] + f[15][2]; //region I

        //below is region II

        float result=0;

#pragma unroll

        for(int i=0; i<16; i++)

#pragma unroll

                for(int j=0; j<3; j++)

                        result += f[i][j];

        *dev = result;

}

avidday · March 13, 2011, 6:58pm

It is an ascii file. Just open it up in an editor. You will see the code generated by the C compiler there. If there are compiler level optimizations which are making a difference, they will be visible in the PTX code. There can be more subtle optimizations and instruction substitution going on at the assembler level that only disassembly will show, but often the PTX can be sufficient to see what is happening.

hyqneuron · March 19, 2011, 2:49pm

Thanks a lot! you guys are really helpful!

hyqneuron · March 19, 2011, 3:03pm

Unroll all loops over indices:

#define org f[0][0]++; f[15][2]++;

#define _10(o) o o o o o o o o o o

__global__ void test1(float *dev)

{

        float f[16][3];

#pragma unroll

        for(int i=0; i<16; i++)

#pragma unroll

                for(int j=0; j<3; j++)

                        f[i][j] = 0;

        for(int i = 0; i<10000; i++) //10000 * 1000 = 10000000

        {_10(_10(_10(org)))}

        //*dev = f[0][0] + f[15][2]; //region I

        //below is region II

        float result=0;

#pragma unroll

        for(int i=0; i<16; i++)

#pragma unroll

                for(int j=0; j<3; j++)

                        result += f[i][j];

        *dev = result;

}

Thanks for the enlightenment!

When I only do f[0][0]++; , without f[15][2]++; in the loop, the time I had was actually 16.07+. Yet when I add the f[15][2]++; in the loop it becomes 20+. The launch configuration I used was always <<<1,1>>>. I was actually expecting the time to remain the same after a second operation was added because I’m one GTX 460 which has 2 dispatch units for one warp scheduler. The second operation is independent from the first, so at every clock the first dispatch unit should issue instruction for the first operation, while the second dispatch unit issues instruction for the second operations. So in this way the first operation and the second operation should be executing in parallel at all times. Then why would the timing increase?

tera · March 20, 2011, 2:09am

I guess what is happening here is that the instruction cache overflows when you add the second increment to the loop. I don’t know how large the instruction cache is for Fermi devices, but for the GT200 class devices it was 8 kb. Assuming Nvidia has kept the size the same for Fermi, this would nicely explain your finding: your loop body compiles to 1000 instructions of 8 bytes each, plus the loop instructions themselves, so it just fits nicely into the 8 kb of L2 instruction cache. If you add a second increment to the loop body, it grows to 2000 instructions of 8 bytes each (plus loop overhead), overflowing the 8 kb cache, so that all instructions need to be fetched from L3 cache (probably with the help of some prefetching).

Try reducing the loop body to half the number of instructions and see what results you get.

hamster143 · March 20, 2011, 10:52am

Compared to, say, GTX 470, it has twice the dispatch units, but only 50% more execution cores, so doubling the amount of work will make dispatch units compete for execution cores.

tera · March 20, 2011, 11:45am

Yes, but (if it weren’t for the instruction cache effects) even with no additional execution cores doubling the amount of work should not increase execution time as the second set of increments is independent, so it can execute during cycles 3 and 4 of the 16 cycle latency of the first set of increments.

hyqneuron · March 20, 2011, 2:45pm

Im running on GTX 460 with 2 dispatch units per warp scheduler… So the instructions for the two operations should be issued in parallel, right? They both run at the same pace.

I haven’t had time to decrease the number of instructions yet… My laptop is running some very very long HDD check… Will test it, hopefully, in 2 days

tera · March 20, 2011, 2:57pm

Yes, I guess they are issued in parallel. But even if they weren’t, the timing wouldn’t change for the reason I explained.

hamster143 · March 20, 2011, 10:33pm

I tried that and it does not make any difference. Even shorter versions take 19 or 20 clocks to run. So the instruction cache theory is ruled out too.

tera · March 21, 2011, 2:39am

So always get 19 or 20 clocks, whether there are one or two increments in the loop body and whether it is unrolled 500 or 1000 times? That seems a lot more plausible, as 16 cycles for the add seemed too low anyway.

hamster143 · March 21, 2011, 2:42am

One increment - 16 clocks

Two increments - 19 clocks

hyqneuron · March 21, 2011, 6:18am

It’s somewhere in the programming guide or ptx manual that arithmetic latency ranges between 16-24 cycles.

hyqneuron · March 25, 2011, 8:34am

Something even more strange has appeared… as hamster143 already showed, even when the number of instructions in the loop is reduced to 200, with 2 increments, the time still stays above 16, at 19 (This is for sm_20 on my GTX460, for sm_21, it’s 23).

However, when I do 3 increments, it jumps to 29 (30 when using sm_21), yet when I add it to 4, it goes down to 22 (23 for sm_21).

Is it a problem of alignment or something?
Anyway, ptxas is surely screwed up for sm_21… or maybe for sm_20 as well

look at this: I have hundreds of independent instructions before this, yet it still puts the add for the counter, the set and the bra so closely together. it just assumes that there will be a lot of threads running (I guess this assumption is appropriate for most of the time, but the people at NVidia certainly could do better).

00000c50: 5000000010211c00 add rn f32 $r4 $r2 $r4
00000c58: 500000000c20dc00 add rn f32 $r3 $r2 $r3
00000c60: 4800c00004001c03 add b32 $r0 $r0 0x1
00000c68: 5000000018219c00 add rn f32 $r6 $r2 $r6
00000c70: 5000000014215c00 add rn f32 $r5 $r2 $r5
00000c78: 5000000010211c00 add rn f32 $r4 $r2 $r4
00000c80: 500000000c20dc00 add rn f32 $r3 $r2 $r3
00000c88: 1a8ec61a8001dc23 set $p0 ne s32 $r0 0x186a0
00000c90: 5000000018219c00 add rn f32 $r6 $r2 $r6
00000c98: 5000000014215c00 add rn f32 $r5 $r2 $r5
00000ca0: 5000000010211c00 add rn f32 $r4 $r2 $r4
00000ca8: 500000000c20dc00 add rn f32 $r3 $r2 $r3
00000cb0: 5000000018219c00 add rn f32 $r6 $r2 $r6
00000cb8: 5000000014215c00 add rn f32 $r5 $r2 $r5
00000cc0: 5000000010211c00 add rn f32 $r4 $r2 $r4
00000cc8: 500000000c20dc00 add rn f32 $r3 $r2 $r3
00000cd0: 4003ffcda00001e7 $p0 bra 0x40