cuda pipeline and unrolled loops

fadeyda · September 9, 2013, 3:59pm

Hi, all
recently i’ve tried to play with auto unroll like that:

#define UNROLL 8
                float V=A[tid];
                for(j=0;j<C/UNROLL;j++){
			for(i=0;i<UNROLL;i++){
				V=V*0.1f+0.9f;
			}
		}

I see that inner loop is unrolled as follows:

mov.f32         %f3, 0f3f666666;        // 0.9
        mov.f32         %f4, 0f3dcccccd;        // 0.1
        mov.f32         %f5, 0f3f666666;        // 0.9
        mov.f32         %f6, 0f3dcccccd;        // 0.1
        mov.f32         %f7, 0f3f666666;        // 0.9
        mov.f32         %f8, 0f3dcccccd;        // 0.1
        mov.f32         %f9, 0f3f666666;        // 0.9
        mov.f32         %f10, 0f3dcccccd;       // 0.1
        mov.f32         %f11, 0f3f666666;       // 0.9
        mov.f32         %f12, 0f3dcccccd;       // 0.1
        mov.f32         %f13, 0f3f666666;       // 0.9
        mov.f32         %f14, 0f3dcccccd;       // 0.1
        mov.f32         %f15, 0f3f666666;       // 0.9
        mov.f32         %f16, 0f3dcccccd;       // 0.1
        mov.f32         %f17, 0f3f666666;       // 0.9
        mov.f32         %f18, 0f3dcccccd;       // 0.1
        mad.f32         %f19, %f18, %f2, %f17;
        mad.f32         %f20, %f16, %f19, %f15;
        mad.f32         %f21, %f14, %f20, %f13;
        mad.f32         %f22, %f12, %f21, %f11;
        mad.f32         %f23, %f10, %f22, %f9;
        mad.f32         %f24, %f8, %f23, %f7;
        mad.f32         %f25, %f6, %f24, %f5;
        mad.f32         %f2, %f4, %f25, %f3;

and run time was 0.15 arb. units. Then i decided to pass 0.1 via argument dt in kernel call:

#define UNROLL 8
                float V=A[tid];
                for(j=0;j<C/UNROLL;j++){
			for(i=0;i<UNROLL;i++){
				V=V*dt+0.9f;
			}
		}

loop was unrolled but in some other way:

mad.f32         %f5, %f3, %f2, %f4;
        mov.f32         %f6, 0f3f666666;        // 0.9
        mad.f32         %f7, %f3, %f5, %f6;
        mov.f32         %f8, 0f3f666666;        // 0.9
        mad.f32         %f9, %f3, %f7, %f8;
        mov.f32         %f10, 0f3f666666;       // 0.9
        mad.f32         %f11, %f3, %f9, %f10;
        mov.f32         %f12, 0f3f666666;       // 0.9
        mad.f32         %f13, %f3, %f11, %f12;
        mov.f32         %f14, 0f3f666666;       // 0.9
        mad.f32         %f15, %f3, %f13, %f14;
        mov.f32         %f16, 0f3f666666;       // 0.9
        mad.f32         %f17, %f3, %f15, %f16;
        mov.f32         %f18, 0f3f666666;       // 0.9
        mad.f32         %f2, %f3, %f17, %f18;

performance dropped to 0.18 arb. units! The only difference is common register %f3 which i suppose is the bottleneck for cuda core pipline. So i decided to write something like:

#define UNROLL 8
		register FL_DBL a=dt;
		register FL_DBL aa[16];
		aa[0]=0.0f;
		for(i=0;i<UNROLL;i++) aa[i]+=a;
		for(j=0;j<C*D/UNROLL;j++){
			for(i=0;i<UNROLL;i++){
				V=V*aa[i]+0.9f;
			}
		}

It is different code but operations are the same. In ptx output I see:

mov.f32         %f35, 0f3f666666;       // 0.9
        mad.f32         %f36, %f34, %f2, %f35;
        mov.f32         %f37, 0f3f666666;       // 0.9
        mad.f32         %f38, %f33, %f36, %f37;
        mov.f32         %f39, 0f3f666666;       // 0.9
        mad.f32         %f40, %f32, %f38, %f39;
        mov.f32         %f41, 0f3f666666;       // 0.9
        mad.f32         %f42, %f31, %f40, %f41;
        mov.f32         %f43, 0f3f666666;       // 0.9
        mad.f32         %f44, %f30, %f42, %f43;
        mov.f32         %f45, 0f3f666666;       // 0.9
        mad.f32         %f46, %f29, %f44, %f45;
        mov.f32         %f47, 0f3f666666;       // 0.9
        mad.f32         %f48, %f28, %f46, %f47;
        mov.f32         %f49, 0f3f666666;       // 0.9
        mad.f32         %f2, %f27, %f48, %f49;

No shared register! And performance increased again to 0.16 arb. units!
If UNROLL = 32 the performance differs greater! I think it is a problem of nvcc that for second code it uses only one register while for first code it uses UNROLL number of registers for one value. My question is how to make nvcc to use UNROLL number of registers for dt parameter?

Actually I address my question to nvcc developers. I’ve also locked through code that multiplies VV and code that multiplies V0.9f and found that first works slower due to mul.f32 %f2,$f1,%f1 while second code generates series of mov.32f %f2,0.9f and mul.32f %f3,%f2,%f1. I suppose second works faster because pipeline logic can’t efficiently load value from the one register. The second code doesn’t not uses same registers. Am I right? If so why not to create the code with different registers in mov.f32 for V*V C++ instruction.

The one thing that I still can’t understand if pipeline and ILP are the same? Why I do I need lots of warps with more overall threads that actually exists on my SM. Even if I will unroll my loops to 16, 32, 128 etc. operations i still need TLP, lots of warps etc.

njuffa · September 9, 2013, 7:27pm

You appear to be looking at PTX code which is an intermediate representation and not the code that actually runs on the machine. PTX code generated by nvcc follows SSA conventions, which loosely speaking means that each result written by an instruction is assigned to a new virtual register. PTX code gets compiled down to machine language by PTXAS, which controls register allocation and instruction scheduling. Note that PTXAS is a compiler, not an assembler, so it is able to apply a fairly wide range of code transformations, including loop unrolling.

To get a handle on the performance differences, you would have to look at the machine language (SASS) generated for the various cases. You can do so by disassembling the binary with cuobjdump --dump-sass.

fadeyda · September 10, 2013, 7:27am

Thanx a lot njuffa. But as i see unrolling happens on nvcc stage.
I had a guess that xptxas optimizes the code. More over a cant see ld.cg.global instructions when compiling with -dlcm=cg.
This disassembly is really helpful and code seems to quite clear.

----------------------------------6 hours later

cuobjdump is a really great thing! One thing I cant understand how it creates --dump-ptx from binary with exactly the same command sequence as in .ptx output of nvcc (except links to lines of .cu source). sass output was very helpful for my problem with --dlcm=cg

btw, does anybody know a useful link to nvcc intrinsics? I want to invoke ld.cg.global(ptx)/LD.E.CG(sass) for only one transaction and -dlcm=cg makes all global transaction to be non-caching.

Topic		Replies	Views
auto unrolled loops and shared registers Teaching & Curriculum Support	0	1034	September 8, 2013
Different output of code when not unrolling loop CUDA Programming and Performance	16	1253	August 22, 2022
Why is this loop using so many registers? CUDA Programming and Performance	7	1185	March 3, 2023
NVCC efficiency just do it... yourself CUDA Programming and Performance	2	1841	August 5, 2008
Cuda compiler loop unroll bug? CUDA Programming and Performance	14	2582	October 25, 2017
CUDA #pragma unroll is slower than unrolled code CUDA Programming and Performance	2	2065	February 7, 2018
Problem with unrolling loops CUDA Programming and Performance	9	8728	November 24, 2011
Register usage problem after static unroll(code generator) CUDA Programming and Performance	6	1506	July 2, 2009
#pragma unroll not behaving as expected CUDA Programming and Performance	1	537	September 10, 2022
#Pragma unroll doesn't work? CUDA Programming and Performance	8	6138	September 19, 2008

cuda pipeline and unrolled loops

Related topics