I’m trying to use float4 vector loads from shared memory to reduce the overhead of addressing computations with respect to individual loads from 4 parallel arrays, but am running into issues where the compiler is generating superfluous lmem access.
At the top of an inner loop in one of my kernels, I have a pairwise distance calculation that looks like this:
float4 refatom = refmol.atoms[refatomi];
float temp,Rij2=0.0f;
temp = fitmol_x[fitatom] - refatom.x;
Rij2 += temp*temp;
temp = fitmol_y[fitatom] - refatom.y;
Rij2 += temp*temp;
temp = fitmol_z[fitatom] - refatom.z;
Rij2 += temp*temp;
fitmol_{xyz} are defined as shared float*, refmol.atoms is a shared float4*. Compiling this code using nvcc --ptx, or using nvcc --cubin and disassembling with decuda shows the following instruction sequence at the top of the loop:
000a58: 2000221d 04050780 label16: add.u32 $r7, $r17, $r20
000a60: 00000e05 c0000780 movsh.b32 $ofs1, $r7, 0x00000000
000a68: 1400001d 4400c780 mov.b32 $r7, s[$ofs1+0x0000]
000a70: d000001d 60c00780 mov.u32 l[$r0], $r7
000a78: 1400021d 4400c780 mov.b32 $r7, s[$ofs1+0x0004]
000a80: d000081d 60c00780 mov.u32 l[$r4], $r7
000a88: 1400041d 4400c780 mov.b32 $r7, s[$ofs1+0x0008]
000a90: d000101d 60c00780 mov.u32 l[$r8], $r7
000a98: 1400061d 4400c780 mov.b32 $r7, s[$ofs1+0x000c]
000aa0: 00001e05 c0000780 movsh.b32 $ofs1, $r15, 0x00000000
000aa8: d000181d 60c00780 mov.u32 l[$r12], $r7
000ab0: d0000061 40c00780 mov.u32 $r24, l[$r0]
000ab8: d000085d 40c00780 mov.u32 $r23, l[$r4]
000ac0: d0001059 40c00780 mov.u32 $r22, l[$r8]
000ac8: d000181d 40c00780 mov.u32 $r7, l[$r12]
000ad0: d4010005 20000780 add.b32 $ofs1, $ofs1, 0x00000080
000ad8: 20001e65 0404c780 add.u32 $r25, $r15, $r19
000ae0: 00003209 c0000780 movsh.b32 $ofs2, $r25, 0x00000000
000ae8: b5586060 add.half.rn.f32 $r24, s[$ofs1+0x0000], -$r24
000aec: b957605c add.half.rn.f32 $r23, s[$ofs2+0x0000], -$r23
000af0: c0183061 00000780 mul.rn.f32 $r24, $r24, $r24
000af8: e0172e61 00060780 mad.rn.f32 $r24, $r23, $r23, $r24
000b00: 20001e5d 04048780 add.u32 $r23, $r15, $r18
000b08: 00002e05 c0000780 movsh.b32 $ofs1, $r23, 0x00000000
000b10: 20159e5c add.half.b32 $r23, $r15, $r21
000b14: b5566058 add.half.rn.f32 $r22, s[$ofs1+0x0000], -$r22
000b18: 00002e05 c0000780 movsh.b32 $ofs1, $r23, 0x00000000
000b20: e0162c59 00060780 mad.rn.f32 $r22, $r22, $r22, $r24
In particular, note that the local store instructions from 0xA68 to 0xA98 are completely redundant since they are followed by local loads at 0xAB0 to 0xAC8 from the exact same addresses. There is clearly no register pressure here, as the registers are available for the local loads. The use of local memory is confirmed by -Xptxas -v, which shows 16 bytes of lmem usage; this holds true for -arch sm_11, sm_13, and sm_20 (did not try others).
Slightly restructuring the CUDA code to the following eliminates the local loads/stores, but at the cost of additional addressing calculation overhead (more add/movsh):
float temp,Rij2=0.0f;
temp = fitmol_x[fitatom] - refmol.atoms[refatomi].x;
Rij2 += temp*temp;
temp = fitmol_y[fitatom] - refmol.atoms[refatomi].y;
Rij2 += temp*temp;
temp = fitmol_z[fitatom] - refmol.atoms[refatomi].z;
Rij2 += temp*temp;
(Note that refatom.w is used further down in the code, in a section with identical disassembly between the versions.)
Is there any particular reason the compiler/assembler are generating these (apparently useless) local memory accesses?
(EDIT: changed topic description to highlight that this is a compiler bug)