nvcc/ptxas unnecessary lmem loads/stores Bug in nvcc alias analysis/PRE stages

I’m trying to use float4 vector loads from shared memory to reduce the overhead of addressing computations with respect to individual loads from 4 parallel arrays, but am running into issues where the compiler is generating superfluous lmem access.

At the top of an inner loop in one of my kernels, I have a pairwise distance calculation that looks like this:

float4 refatom = refmol.atoms[refatomi];

float temp,Rij2=0.0f;

temp = fitmol_x[fitatom] - refatom.x;

Rij2 += temp*temp;

temp = fitmol_y[fitatom] - refatom.y;

Rij2 += temp*temp;

temp = fitmol_z[fitatom] - refatom.z;

Rij2 += temp*temp;

fitmol_{xyz} are defined as shared float*, refmol.atoms is a shared float4*. Compiling this code using nvcc --ptx, or using nvcc --cubin and disassembling with decuda shows the following instruction sequence at the top of the loop:

000a58: 2000221d 04050780 label16: add.u32 $r7, $r17, $r20

000a60: 00000e05 c0000780 movsh.b32 $ofs1, $r7, 0x00000000

000a68: 1400001d 4400c780 mov.b32 $r7, s[$ofs1+0x0000]

000a70: d000001d 60c00780 mov.u32 l[$r0], $r7

000a78: 1400021d 4400c780 mov.b32 $r7, s[$ofs1+0x0004]

000a80: d000081d 60c00780 mov.u32 l[$r4], $r7

000a88: 1400041d 4400c780 mov.b32 $r7, s[$ofs1+0x0008]

000a90: d000101d 60c00780 mov.u32 l[$r8], $r7

000a98: 1400061d 4400c780 mov.b32 $r7, s[$ofs1+0x000c]

000aa0: 00001e05 c0000780 movsh.b32 $ofs1, $r15, 0x00000000

000aa8: d000181d 60c00780 mov.u32 l[$r12], $r7

000ab0: d0000061 40c00780 mov.u32 $r24, l[$r0]

000ab8: d000085d 40c00780 mov.u32 $r23, l[$r4]

000ac0: d0001059 40c00780 mov.u32 $r22, l[$r8]

000ac8: d000181d 40c00780 mov.u32 $r7, l[$r12]

000ad0: d4010005 20000780 add.b32 $ofs1, $ofs1, 0x00000080

000ad8: 20001e65 0404c780 add.u32 $r25, $r15, $r19

000ae0: 00003209 c0000780 movsh.b32 $ofs2, $r25, 0x00000000

000ae8: b5586060		  add.half.rn.f32 $r24, s[$ofs1+0x0000], -$r24

000aec: b957605c		  add.half.rn.f32 $r23, s[$ofs2+0x0000], -$r23

000af0: c0183061 00000780 mul.rn.f32 $r24, $r24, $r24

000af8: e0172e61 00060780 mad.rn.f32 $r24, $r23, $r23, $r24

000b00: 20001e5d 04048780 add.u32 $r23, $r15, $r18

000b08: 00002e05 c0000780 movsh.b32 $ofs1, $r23, 0x00000000

000b10: 20159e5c		  add.half.b32 $r23, $r15, $r21

000b14: b5566058		  add.half.rn.f32 $r22, s[$ofs1+0x0000], -$r22

000b18: 00002e05 c0000780 movsh.b32 $ofs1, $r23, 0x00000000

000b20: e0162c59 00060780 mad.rn.f32 $r22, $r22, $r22, $r24

In particular, note that the local store instructions from 0xA68 to 0xA98 are completely redundant since they are followed by local loads at 0xAB0 to 0xAC8 from the exact same addresses. There is clearly no register pressure here, as the registers are available for the local loads. The use of local memory is confirmed by -Xptxas -v, which shows 16 bytes of lmem usage; this holds true for -arch sm_11, sm_13, and sm_20 (did not try others).

Slightly restructuring the CUDA code to the following eliminates the local loads/stores, but at the cost of additional addressing calculation overhead (more add/movsh):

float temp,Rij2=0.0f;

temp = fitmol_x[fitatom] - refmol.atoms[refatomi].x;

Rij2 += temp*temp;

temp = fitmol_y[fitatom] - refmol.atoms[refatomi].y;

Rij2 += temp*temp;

temp = fitmol_z[fitatom] - refmol.atoms[refatomi].z;

Rij2 += temp*temp;

(Note that refatom.w is used further down in the code, in a section with identical disassembly between the versions.)

Is there any particular reason the compiler/assembler are generating these (apparently useless) local memory accesses?

(EDIT: changed topic description to highlight that this is a compiler bug)

Just out of curiosity, does it help to declare

volatile float4 refatom = refmol.atoms[refatomi];

?

Nope. That makes it worse, actually:

000ab0: d0000069 40c00780 mov.u32 $r26, l[$r0]

000ab8: d0000831 40c00780 mov.u32 $r12, l[$r4]

000ac0: d0001061 40c00780 mov.u32 $r24, l[$r8]

000ac8: d0001859 40c00780 mov.u32 $r22, l[$r12]

000ad0: d000185d 40c00780 mov.u32 $r23, l[$r12]

Note that l[$r12] is loaded twice in a row. Changing “volatile” to “const” doesn’t help either.

PS - This behavior occurs under both nvcc 2.3 and nvcc 3.0.

refmol.atoms was originally allocated using a pointer offset into dynamic shared memory:

extern __shared__ float shmem[];

// ... Allocate 4 fitmol float arrays of length fitmol.natoms ...

refmol.atoms = (float4*)(shmem + 4*fitmol.natoms);

To see if this is a problem with the compiler inferring misalignment (despite the fact that alignment should not be a problem in shared memory), I tried aligning to the start of float and float4 dynamic shared arrays:

extern __shared__ float shmem[];

refmol.atoms = (float4*)(shmem);

// ... Allocate 4 fitmol float arrays of length fitmol.natoms ...
extern __shared__ float4 shmem[];

refmol.atoms = shmem;

float* fshmem = (float*)(shmem+refmol.natoms);

// ... Allocate 4 fitmol float arrays of length fitmol.natoms ...

Both of these options also produce the same pattern of local loads and stores, so it doesn’t seem to be an alignment issue.

I think that this breaks the alias analysis of the compiler, so it now has to reload the variable after each write to shared memory. You could try to see if the redundant loads go away if you use a static allocation. However I don’t know a way to allocate a variable size object in shared memory that would still allow alias analysis to work.

Fair enough, but it’s not clear to me why alias analysis should trigger that code sequence. Why an intermediate dump to local memory?

That’s right.

I’ve once seen gcc do the same thing when using vectors that won’t fit into one SSE register. Maybe it helps to declare four floats instead of a float4? I do agree this is annoying, though.

Well, it’s still not clear to me why it works, but it worked. Added the following definition:

float4* __restrict__ refatoms = refmol.atoms;

// ...

float4 refatom = refatoms[refatomi]

And the lmem usage is gone. However, I have to be judicious in my use of restrict. Certain combinations of tagging pointers as restrict (specifically, transform and any fitmol_XXX pointer cause compiler crashes:

__device__ float devInternalOverlapVolume(float* __restrict__ fitmol_x,float* fitmol_y,float* fitmol_z,float* fitmol_a,uint ourfitcount,

										  float4_mol& refmol,

										  float* __restrict__ transform,float* shmem_temp,clock_t* shTimers)

Leads to the following crash under both nvcc 2.3 and nvcc 3.0:

Signal: Segmentation fault in Global Optimization -- LPRE: Var phi placement phase.

<input>(0): Error: Signal Segmentation fault in phase Global Optimization -- LPRE: Var phi placement -- processing aborted

*** Internal stack backtrace:

	/usr/local/cuda/open64/lib//be [0x6d1c6f]

	/usr/local/cuda/open64/lib//be [0x6d28b9]

	/usr/local/cuda/open64/lib//be [0x6d200d]

	/usr/local/cuda/open64/lib//be [0x6d3256]

	/lib/libc.so.6 [0x2b0b3e959040]

	/usr/local/cuda/open64/lib//be [0x4f1f2a]

	/usr/local/cuda/open64/lib//be [0x537e6b]

	/usr/local/cuda/open64/lib//be [0x538798]

	/usr/local/cuda/open64/lib//be [0x525397]

	/usr/local/cuda/open64/lib//be [0x5259d2]

	/usr/local/cuda/open64/lib//be [0x422f86]

	/usr/local/cuda/open64/lib//be [0x47a4dd]

	/usr/local/cuda/open64/lib//be [0x4044d2]

	/usr/local/cuda/open64/lib//be [0x40515e]

	/usr/local/cuda/open64/lib//be [0x4061f1]

	/usr/local/cuda/open64/lib//be [0x40752d]

	/lib/libc.so.6(__libc_start_main+0xe6) [0x2b0b3e9445a6]

	/usr/local/cuda/open64/lib//be [0x4038da]

nvopencc INTERNAL ERROR: /usr/local/cuda/open64/lib//be died due to signal 4

Hey, great find! So it looks like it’s connected to alias analysis, even though that doesn’t seem to make sense. Maybe the crashes also indicate something’s not quite right in there.
I should have thought about restrict - instead of avoiding casts to help the compiler, just tell him it does not aliase.

Bump.

There’s clearly a bug in the compiler’s alias analysis or PRE routines (it shouldn’t be using lmem and it certainly shouldn’t crash!). I’m having a hard time coming up with a properly minimal test case; is there anything else anyone at NV would like from me to help debug this issue?

I dont have an answer to your question but I wanted to ask you how do you disassemble using decuda?

Compile your kernel to .cubin and run decuda on that file. I haven’t had any success doing this with CUDA 3.x - decuda can’t handle the new ELF format, and I haven’t been able to get ptxas to spit out the old format. For disassembly I’ve had to compile with CUDA 2.3.

You can force the compiler to compile to the old cubin format; see

http://wiki.github.com/laanwj/decuda/cuda-30-cubin-format

I’m working on supporting the ELF based format, I’ve already partially reverse engineered it, but due to lack of good ELF parsing libraries for Python the implementation is proving to be quite a hassle.

No dice; is this a Windows-only workaround? I tried it under Linux with CUDA 3.1 (though, fwiw, I had seen that same page before and tried it with CUDA 3.0 and it did the same thing):

ihaque@dev:~/paper/branches/color$ nvcc -O3 -arch sm_11 -o deviceOverlay.ptx -ptx deviceOverlay.cu														ihaque@dev:~/paper/branches/color$ ptxas -arch sm_11 deviceOverlay.ptx -o deviceOverlay.cubin

ihaque@dev:~/paper/branches/color$ python /home/ihaque/packages/decuda/decuda.py deviceOverlay.cubin

Traceback (most recent call last):

  File "/home/ihaque/packages/decuda/decuda.py", line 92, in <module>

	main()

  File "/home/ihaque/packages/decuda/decuda.py", line 55, in main

	cu = load(args[0])

  File "/home/ihaque/packages/decuda/CubinFile.py", line 258, in load

	inst = [int(x,0) for x in inst]

ValueError: invalid literal for int() with base 0: '\x7fELF\x02\x01\x013\x02\x00\x00\x00\x00\x00\x00\x00\x02\x00\xbe\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xd0\xa7\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x00\x0b\x00\x0b\x00@\x008\x00\x0c\x00@\x00\x19\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\x06\x00\x00\x00\x00\x00\x00\xe0\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0b\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00`\t\x00\x00\x00\x00\x00\x00\x88\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x13\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xe8\t\x00\x00\x00\x00\x00\x000\x03\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x1b\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x18\x00\x00\x00\x00\x00\x00\x006\x01\x00\x00\x01\x00\x00\x00\x06\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x18\r\x00\x00\x00\x00\x00\x00\xb0$\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x0e\x00\x00;\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00#\x02\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xc81\x00\x00\x00\x00\x00\x000\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00O\x01\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf81\x00\x00\x00\x00\x00\x00\x0c\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x02\x00\x00\x08\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x042\x00\x00\x00\x00\x00\x00\x90\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf7\x00\x00\x00\x01\x00\x00\x00\x06\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x042\x00\x00\x00\x00\x00\x00`\x1d\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x0c\x00\x00/\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xdf\x01\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00dO\x00\x00\x00\x00\x00\x00'

Setting the CUBINS_ARE_ELF flag in /usr/local/cuda/nvcc.profile to 1,0, or deleting it entirely makes no difference. It seems to always output ELF. Any ideas?

I don’t have experience with said work-around myself except in the 3.0 beta. They might have removed that function after that.

In Linux you could try extracting the contents of the .text. section using the ‘elfread’ utility and feeding that to decuda manually (using ‘raw’ mode). I might implement such a workaround, though it will only work on Linux/Unix and when the utility is present.

Thanks for help. I could manage to run decuda. But my output of decuda does not have any load/store operations however my ptx code does have them. Why is that?

Sorry for my naivety, I also wanted to know what is the use of cudaasm and how do we run that?

Thanks again.

Legacy cubin support was removed in 3.0 final.

Loads/stores in decuda output are represented as mov instructions with one operand being a c (constant), g (global), or s (shared) reference. cudaasm is an unofficial assembler written by the decuda folks. I’ve never used it myself, so I can’t help you there…

Ah, thanks for the tip. I’ve written a quick script to parse the ELF data (using objdump) from new-style CUBINs and output something that looks like an old CUBIN for decuda: http://forums.nvidia.com/index.php?showtopic=172577

tmurray: Anything y’all need from me to handle the compiler/optimizer bug?

Be careful using that. ;) (that’s actually for other folks who find this thread, you guys probably know this)

There’s a crapload of stuff in 3.0 that decuda currently can’t handle.