cuda pipeline and unrolled loops

Hi, all
recently i’ve tried to play with auto unroll like that:

#define UNROLL 8
                float V=A[tid];
                for(j=0;j<C/UNROLL;j++){
			for(i=0;i<UNROLL;i++){
				V=V*0.1f+0.9f;
			}
		}

I see that inner loop is unrolled as follows:

mov.f32         %f3, 0f3f666666;        // 0.9
        mov.f32         %f4, 0f3dcccccd;        // 0.1
        mov.f32         %f5, 0f3f666666;        // 0.9
        mov.f32         %f6, 0f3dcccccd;        // 0.1
        mov.f32         %f7, 0f3f666666;        // 0.9
        mov.f32         %f8, 0f3dcccccd;        // 0.1
        mov.f32         %f9, 0f3f666666;        // 0.9
        mov.f32         %f10, 0f3dcccccd;       // 0.1
        mov.f32         %f11, 0f3f666666;       // 0.9
        mov.f32         %f12, 0f3dcccccd;       // 0.1
        mov.f32         %f13, 0f3f666666;       // 0.9
        mov.f32         %f14, 0f3dcccccd;       // 0.1
        mov.f32         %f15, 0f3f666666;       // 0.9
        mov.f32         %f16, 0f3dcccccd;       // 0.1
        mov.f32         %f17, 0f3f666666;       // 0.9
        mov.f32         %f18, 0f3dcccccd;       // 0.1
        mad.f32         %f19, %f18, %f2, %f17;
        mad.f32         %f20, %f16, %f19, %f15;
        mad.f32         %f21, %f14, %f20, %f13;
        mad.f32         %f22, %f12, %f21, %f11;
        mad.f32         %f23, %f10, %f22, %f9;
        mad.f32         %f24, %f8, %f23, %f7;
        mad.f32         %f25, %f6, %f24, %f5;
        mad.f32         %f2, %f4, %f25, %f3;

and run time was 0.15 arb. units. Then i decided to pass 0.1 via argument dt in kernel call:

#define UNROLL 8
                float V=A[tid];
                for(j=0;j<C/UNROLL;j++){
			for(i=0;i<UNROLL;i++){
				V=V*dt+0.9f;
			}
		}

loop was unrolled but in some other way:

mad.f32         %f5, %f3, %f2, %f4;
        mov.f32         %f6, 0f3f666666;        // 0.9
        mad.f32         %f7, %f3, %f5, %f6;
        mov.f32         %f8, 0f3f666666;        // 0.9
        mad.f32         %f9, %f3, %f7, %f8;
        mov.f32         %f10, 0f3f666666;       // 0.9
        mad.f32         %f11, %f3, %f9, %f10;
        mov.f32         %f12, 0f3f666666;       // 0.9
        mad.f32         %f13, %f3, %f11, %f12;
        mov.f32         %f14, 0f3f666666;       // 0.9
        mad.f32         %f15, %f3, %f13, %f14;
        mov.f32         %f16, 0f3f666666;       // 0.9
        mad.f32         %f17, %f3, %f15, %f16;
        mov.f32         %f18, 0f3f666666;       // 0.9
        mad.f32         %f2, %f3, %f17, %f18;

performance dropped to 0.18 arb. units! The only difference is common register %f3 which i suppose is the bottleneck for cuda core pipline. So i decided to write something like:

#define UNROLL 8
		register FL_DBL a=dt;
		register FL_DBL aa[16];
		aa[0]=0.0f;
		for(i=0;i<UNROLL;i++) aa[i]+=a;
		for(j=0;j<C*D/UNROLL;j++){
			for(i=0;i<UNROLL;i++){
				V=V*aa[i]+0.9f;
			}
		}

It is different code but operations are the same. In ptx output I see:

mov.f32         %f35, 0f3f666666;       // 0.9
        mad.f32         %f36, %f34, %f2, %f35;
        mov.f32         %f37, 0f3f666666;       // 0.9
        mad.f32         %f38, %f33, %f36, %f37;
        mov.f32         %f39, 0f3f666666;       // 0.9
        mad.f32         %f40, %f32, %f38, %f39;
        mov.f32         %f41, 0f3f666666;       // 0.9
        mad.f32         %f42, %f31, %f40, %f41;
        mov.f32         %f43, 0f3f666666;       // 0.9
        mad.f32         %f44, %f30, %f42, %f43;
        mov.f32         %f45, 0f3f666666;       // 0.9
        mad.f32         %f46, %f29, %f44, %f45;
        mov.f32         %f47, 0f3f666666;       // 0.9
        mad.f32         %f48, %f28, %f46, %f47;
        mov.f32         %f49, 0f3f666666;       // 0.9
        mad.f32         %f2, %f27, %f48, %f49;

No shared register! And performance increased again to 0.16 arb. units!
If UNROLL = 32 the performance differs greater! I think it is a problem of nvcc that for second code it uses only one register while for first code it uses UNROLL number of registers for one value. My question is how to make nvcc to use UNROLL number of registers for dt parameter?

Actually I address my question to nvcc developers. I’ve also locked through code that multiplies VV and code that multiplies V0.9f and found that first works slower due to mul.f32 %f2,$f1,%f1 while second code generates series of mov.32f %f2,0.9f and mul.32f %f3,%f2,%f1. I suppose second works faster because pipeline logic can’t efficiently load value from the one register. The second code doesn’t not uses same registers. Am I right? If so why not to create the code with different registers in mov.f32 for V*V C++ instruction.

The one thing that I still can’t understand if pipeline and ILP are the same? Why I do I need lots of warps with more overall threads that actually exists on my SM. Even if I will unroll my loops to 16, 32, 128 etc. operations i still need TLP, lots of warps etc.

You appear to be looking at PTX code which is an intermediate representation and not the code that actually runs on the machine. PTX code generated by nvcc follows SSA conventions, which loosely speaking means that each result written by an instruction is assigned to a new virtual register. PTX code gets compiled down to machine language by PTXAS, which controls register allocation and instruction scheduling. Note that PTXAS is a compiler, not an assembler, so it is able to apply a fairly wide range of code transformations, including loop unrolling.

To get a handle on the performance differences, you would have to look at the machine language (SASS) generated for the various cases. You can do so by disassembling the binary with cuobjdump --dump-sass.

Thanx a lot njuffa. But as i see unrolling happens on nvcc stage.
I had a guess that xptxas optimizes the code. More over a cant see ld.cg.global instructions when compiling with -dlcm=cg.
This disassembly is really helpful and code seems to quite clear.

----------------------------------6 hours later

cuobjdump is a really great thing! One thing I cant understand how it creates --dump-ptx from binary with exactly the same command sequence as in .ptx output of nvcc (except links to lines of .cu source). sass output was very helpful for my problem with --dlcm=cg

btw, does anybody know a useful link to nvcc intrinsics? I want to invoke ld.cg.global(ptx)/LD.E.CG(sass) for only one transaction and -dlcm=cg makes all global transaction to be non-caching.