Array Addressing

When I create an array:

device constant const uint16_t array[256] = {…};

and I look at the PTX, I see:

.const .align 2 .b8 array[512] = {…};

Is there a way to modify this so that this is generated?

.const .align 1 .b16 array[256] = {…};

I ask, as I believe that it will eliminate a SHL R65, R65, 0x1; instruction that occurs each time the array is accessed.

Regards,

Richard

Your code is presumably written so it use array indexes, while the machine code operates on addresses. Since each data item here comprises two bytes, and array index needs to be multiplied by two to generate the corresponding an address offset. This is the function of the SHL.

That said, in optimized release builds this multiplication often disappears, as strength reduction and induction variables are applied to loops that traverse arrays, for example. In other instances, the necessary multiplication can be subsumed into ISCADD or LEA instructions.

Make sure you are looking at the SASS (machine) code generated from a release build, not a debug build. The compiler disables all optimizations for debug builds.

Thanks, it was a bit of a long shot, as this is occuring in a fully optimised release build and so if it could be removed, it would have been.

sm61 Cuda 10.2

PTX:
      xor.b32 %r1815, %r48, %r1814; //Array index
      shl.b32 %r1816, %r1815, 1;
      add.s32 %r1817, %r5221, %r1816;
      ld.shared.u16 %r1818, [%r1817];

SASS:
        /*3268*/                   LOP.XOR R60, R8, R65 ;                      /* 0x5c4704000417083c */
        /*3270*/                   SHL R60, R60, 0x1 ;                         /* 0x3848000000173c3c */
        /*3278*/                   LDS.U.U16 R60, [R60] ;                      /* 0xef4a100000073c3c */

The inner loop is 7 instructions of which this SHL (and a SHR) are half the throughput of the others, so removal would probably be quite beneficial.

What likely inhibits the absorption of the SHL into surrounding computation is the use of the XOR operation on the array index. Use of XOR is unusual in indexing computations. Can you replace that with arithmetic operations?

What does the CUDA profiler say about performance bottlenecks in the kernel at hand? While it is not impossible, I am doubtful the SHL constitutes a bottleneck here.

I’ll see what I can do.

Unfortunately I can’t profile. The app is an old 32bit one and the last Cuda version that profiled 32bit, can’t generate sm61. My C / Cuda is currently not yet at a level that I’ve been able to sucessfully convert it to 64bit. :(