NVCC potentially missing a memory optimization

Can NVCC safely remove the first store to a[0] in this code? There are no loads from a after this store, only a store at the end that overwrites the previous value.

__global__ void kernel(int* a, int N) {
    a[0] = 0;
    int temp = 0;
    for (int j = 0; j < N; j ++) {
      temp += 3;

    a[0] = temp;

Compiled with NVCC 12.0 using nvcc -O3 -arch=sm_80, I get this PTX and SASS (can also be seen on godbolt):

.version 8.0
.target sm_80
.address_size 64

	// .globl	_Z6kernelPii

.visible .entry _Z6kernelPii(
	.param .u64 _Z6kernelPii_param_0,
	.param .u32 _Z6kernelPii_param_1
	.reg .pred 	%p<6>;
	.reg .b32 	%r<27>;
	.reg .b64 	%rd<3>;

	ld.param.u64 	%rd2, [_Z6kernelPii_param_0];
	ld.param.u32 	%r14, [_Z6kernelPii_param_1];
	cvta.to.global.u64 	%rd1, %rd2;
	mov.u32 	%r26, 0;
	st.global.u32 	[%rd1], %r26;
	setp.lt.s32 	%p1, %r14, 1;
	@%p1 bra 	$L__BB0_7;

	add.s32 	%r18, %r14, -1;
	and.b32  	%r25, %r14, 3;
	setp.lt.u32 	%p2, %r18, 3;
	mov.u32 	%r26, 0;
	@%p2 bra 	$L__BB0_4;

	sub.s32 	%r21, %r14, %r25;

	add.s32 	%r26, %r26, 12;
	add.s32 	%r21, %r21, -4;
	setp.ne.s32 	%p3, %r21, 0;
	@%p3 bra 	$L__BB0_3;

	setp.eq.s32 	%p4, %r25, 0;
	@%p4 bra 	$L__BB0_7;

	.pragma "nounroll";
	add.s32 	%r26, %r26, 3;
	add.s32 	%r25, %r25, -1;
	setp.ne.s32 	%p5, %r25, 0;
	@%p5 bra 	$L__BB0_6;

	st.global.u32 	[%rd1], %r26;

kernel(int*, int):
 IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] 
 IMAD.MOV.U32 R0, RZ, RZ, c[0x0][0x168] 
 ULDC.64 UR4, c[0x0][0x118] 
 IMAD.MOV.U32 R2, RZ, RZ, c[0x0][0x160] 
 IMAD.MOV.U32 R3, RZ, RZ, c[0x0][0x164] 
 ISETP.GE.AND P0, PT, R0, 0x1, PT 
 STG.E [R2.64], RZ 
 @!P0 BRA `(.L_x_0) 
 IADD3 R4, R0.reuse, -0x1, RZ 
 LOP3.LUT R0, R0, 0x3, RZ, 0xc0, !PT 
 ISETP.GE.U32.AND P0, PT, R4, 0x3, PT 
 @!P0 BRA `(.L_x_1) 
 IADD3 R4, -R0, c[0x0][0x168], RZ 
 @!P0 BRA `(.L_x_2) 
 ISETP.GT.AND P1, PT, R4, 0xc, PT 
 PLOP3.LUT P0, PT, PT, PT, PT, 0x80, 0x0 
 @!P1 BRA `(.L_x_3) 
 PLOP3.LUT P0, PT, PT, PT, PT, 0x8, 0x0 
 IADD3 R4, R4, -0x10, RZ 
 IADD3 R5, R5, 0x30, RZ 
 ISETP.GT.AND P1, PT, R4, 0xc, PT 
 @P1 BRA `(.L_x_4) 
 ISETP.GT.AND P1, PT, R4, 0x4, PT 
 @P1 PLOP3.LUT P0, PT, PT, PT, PT, 0x8, 0x0 
 @P1 IADD3 R4, R4, -0x8, RZ 
 @P1 IADD3 R5, R5, 0x18, RZ 
 ISETP.NE.OR P0, PT, R4, RZ, P0 
 @!P0 BRA `(.L_x_1) 
 IADD3 R4, R4, -0x4, RZ 
 IADD3 R5, R5, 0xc, RZ 
 @P0 BRA `(.L_x_2) 
 @!P0 BRA `(.L_x_0) 
 IADD3 R0, R0, -0x1, RZ 
 IADD3 R5, R5, 0x3, RZ 
 @P0 BRA `(.L_x_5) 
 STG.E [R2.64], R5 
 BRA `(.L_x_6)

There are two store instructions and no loads, so could the first store be removed? If I replace the loop range N with a constant, then NVCC optimizes the loop away and only stores the final value of temp to a.

I cannot think of a reason why that initial store has to be kept and cannot be eliminated by the compiler as redundant. a[] is not a volatile data object, there is no potential for aliasing, and execution is guaranteed to reach the second store by construction. But I am not a compiler engineer and may have overlooked something. I checked other versions of CUDA and do not see the initial store eliminated in any of them.

Given the above, why does that first store appear in the source code in the first place? If you were to file an enhancement request with NVIDIA on this, my expectation would be that it would be assigned low priority as there is a simple workaround and the performance impact in real-life contexts would appear to be low.

The code above is a minimal reproducer of the issue. The actual code is a kernel that is fused from two individual kernels, the first of which stores to a and the second reads from and writes to it. The value stored in the original first kernel will always be read in the second one, so I had hoped that the compiler would recognize that and eliminate the first store automatically (I made sure there was no aliasing with other arrays using __restrict__).

Fused how, i.e. by which mechanism? I am wondering whether the superfluous operation can be removed at source level as part of the fusing process. I have addressed similar problems in the past by creating processing cores which are called by a thin wrapper that just does some loading and storing. The cores can thus be textually merged without impacting the loading/storing. Since the cores get inlined into the wrapper function there was no inefficiency from functional calls.

How much of a performance difference do you observe when you manually remove the superfluous operation from the fused kernel?

I fused the kernels by hand by just concatenating the bodies of the two kernels (which is still functionally correct in this case). I’m studying the feasibility of automating this process and this case occurs frequently in the code I’m looking at. It might be possible to automatically remove the extra store at the source level but I would prefer that the compiler do this as I believe a proper implementation requires dataflow analysis.

On my A100, the fused kernel is around 15% slower than calling the two separately.

I realize I should have phrased my question more clearly. Let me try again:

With the fused kernel, what is the performance with and without the “missed memory optimization”?

Clearly the fusing of kernels can have performance benefits, as it potentially allows the compiler to optimize more aggressively, e.g. by adding degrees of freedom and/or supplying additional information that it can use.

You are obviously free to file an enhancement request with NVIDIA on anything you wish at any time. I am just asking what the compiler engineers would likely ask when you do.

I see, I misunderstood your question. In this case, the fused kernel without this optimization is also around 15% slower than with it.

In that case, I would recommend filing an enhancement request / performance bug with NVIDIA. In my experience, turnaround on those can take a while, so you might want to employ a mitigation strategy in the meantime (or live with the lower performance for now).