Redundant MOVs?

rs277 · January 23, 2023, 6:40pm

Thanks both for the interest and I agree with, " I doubt these extra MOVs have a significant impact on performance.", as it was curiosity on my part as to whether there is an actual need for the extras.

The code I posted was an effort to see if I could improve on what the compiler produces with:

    for(i = 0; i < 32; i++){
    	sh_LUT[threadIdx.x][i] = LUT[threadIdx.x];
    }

which was:

        /*00a8*/                   LDG.E.CI.U16 R13, [R4] ;                      /* 0xeed2a0000007040d */
        /*00b0*/                   ISCADD R38, R38, R29, 0x7 ;                   /* 0x5c18038001d72626 */
        /*00b8*/                   SHL R12, R29, 0x6 ;                           /* 0x3848000000671d0c */
                                                                                 /* 0x201fc400fe6007ed */
        /*00c8*/                   ISETP.GT.AND P0, PT, R38, c[0x2][0x0], PT ;   /* 0x4b69038800072607 */
        /*00d0*/                   DEPBAR.LE SB5, 0x1 ;                          /* 0xf0f0000034170000 */
        /*00d8*/                   XMAD.PSL.CLO R8, R8.H1, 0x1, R0.reuse ;       /* 0x3624001000170808 */
                                                                                 /* 0x001f8501fe2807f1 */
        /*00e8*/                   XMAD.PSL.CLO R9, R9.H1, 0x1, R0.reuse ;       /* 0x3624001000170909 */
        /*00f0*/                   XMAD.PSL.CLO R4, R4.H1, 0x1, R0.reuse ;       /* 0x3624001000170404 */
        /*00f8*/                   XMAD.PSL.CLO R5, R5.H1, 0x1, R0 ;             /* 0x3624001000170505 */
                                                                                 /* 0x081fc400fe2807f1 */
        /*0108*/                   XMAD.PSL.CLO R6, R6.H1, 0x1, R0.reuse ;       /* 0x3624001000170606 */
        /*0110*/                   XMAD.PSL.CLO R7, R7.H1, 0x1, R0 ;             /* 0x3624001000170707 */
        /*0118*/                   XMAD.PSL.CLO R8, R0.reuse, 0x1, R8 ;          /* 0x3604041000170008 */
                                                                                 /* 0x0003c440fe0207f4 */
        /*0128*/                   XMAD.PSL.CLO R9, R0.reuse, 0x1, R9 ;          /* 0x3604049000170009 */
        /*0130*/         {         XMAD.PSL.CLO R4, R0.reuse, 0x1, R4 ;          /* 0x3604021000170004 */
        /*0138*/                   STS.64 [R12], R8         }
                                                                                 /* 0xef5d000000070c08 */
                                                                                 /* 0x0007c440fe0007e4 */
        /*0148*/                   XMAD.PSL.CLO R5, R0, 0x1, R5 ;                /* 0x3604029000170005 */
        /*0150*/         {         XMAD.PSL.CLO R6, R0.reuse, 0x1, R6 ;          /* 0x3604031000170006 */
        /*0158*/                   STS.64 [R12+0x10], R4         }
                                                                                 /* 0xef5d000001070c04 */
                                                                                 /* 0x000bc500fe0007f4 */
        /*0168*/                   XMAD.PSL.CLO R7, R0, 0x1, R7 ;                /* 0x3604039000170007 */
        /*0170*/         {         XMAD.PSL.CLO R2, R2.H1, 0x1, R0.reuse ;       /* 0x3624001000170202 */
        /*0178*/                   STS.64 [R12+0x18], R6         }
                                                                                 /* 0xef5d000001870c06 */
                                                                                 /* 0x001f8500fe2807f1 */
        /*0188*/                   XMAD.PSL.CLO R3, R3.H1, 0x1, R0.reuse ;       /* 0x3624001000170303 */
        /*0190*/                   XMAD.PSL.CLO R10, R10.H1, 0x1, R0.reuse ;     /* 0x3624001000170a0a */
        /*0198*/                   XMAD.PSL.CLO R11, R11.H1, 0x1, R0 ;           /* 0x3624001000170b0b */

Also, this sequence is repeated a second time, as I’m loading 256 values using 128 threads.

Turns out I mis-recalled, from Sylvain Collange’s response here:

“In the power measurements that we did some time ago (paper here), global memory accesses accounted for most of the power consumption. The kind of arithmetic instruction executed did not matter as much as their throughput (so register-register MOVs were burning more power than MADs as they can run on both execution pipelines.)”

Unfortunately the link refering to the paper is dead.