Is mixing two different memory type pointers in same predication allowed?(global mem and register)

In an attempt to write vector-addition without a branch, I tried predicated operations:

__global__ void add(const float* a, const float* b, float* c) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    const float aTmp = 0.0f;
    const float bTmp = 0.0f;
    __shared__ float tmp;
    const bool condition = idx < 10000;
    const float* aPtr = condition ? &a[idx] : &aTmp; 
    const float* bPtr = condition ? &b[idx] : &bTmp;
    float* cPtr = condition ? &c[idx] : &tmp;
    *cPtr = *aPtr + *bPtr;
}

output on godbolt for PTX:

.visible .entry add(float const*, float const*, float*)(
	.param .u64 add(float const*, float const*, float*)_param_0,
	.param .u64 add(float const*, float const*, float*)_param_1,
	.param .u64 add(float const*, float const*, float*)_param_2
)
{
	mov.u64 	%SPL, __local_depot0;
	cvta.local.u64 	%SP, %SPL;
	ld.param.u64 	%rd8, [add(float const*, float const*, float*)_param_0];
	ld.param.u64 	%rd9, [add(float const*, float const*, float*)_param_1];
	ld.param.u64 	%rd6, [add(float const*, float const*, float*)_param_2];
	add.u64 	%rd10, %SP, 0;
	add.u64 	%rd11, %SPL, 0;
	add.u64 	%rd12, %SP, 4;
	add.u64 	%rd13, %SPL, 4;
	mov.u32 	%r1, %ntid.x;
	mov.u32 	%r2, %ctaid.x;
	mov.u32 	%r3, %tid.x;
	mad.lo.s32 	%r4, %r2, %r1, %r3;
	mov.u32 	%r5, 0;
	st.local.u32 	[%rd11], %r5;
	st.local.u32 	[%rd13], %r5;
	setp.gt.s32 	%p1, %r4, 9999;
	setp.lt.s32 	%p2, %r4, 10000;
	cvt.s64.s32 	%rd1, %r4;
	mul.wide.s32 	%rd14, %r4, 4;
	add.s64 	%rd15, %rd8, %rd14;
	selp.b64 	%rd2, %rd15, %rd10, %p2;
	add.s64 	%rd16, %rd9, %rd14;
	selp.b64 	%rd3, %rd16, %rd12, %p2;
	mov.u32 	%r6, add(float const*, float const*, float*)::tmp;
	{ .reg .b64 %tmp;
		cvt.u64.u32 	%tmp, %r6;
		cvta.shared.u64 	%rd18, %tmp; }
		@%p1 bra 	$L__BB0_2;
		shl.b64 	%rd17, %rd1, 2;
		add.s64 	%rd18, %rd6, %rd17;
$L__BB0_2:
		ld.f32 	%f1, [%rd3];
		ld.f32 	%f2, [%rd2];
		add.f32 	%f3, %f2, %f1;
		st.f32 	[%rd18], %f3;
		ret;
	}

but when I change the output predication to a register instead of shared memory:

__global__ void add(const float* a, const float* b, float* c) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    const float aTmp = 0.0f;
    const float bTmp = 0.0f;
    float tmp; // now a register
    const bool condition = idx < 10000;
    const float* aPtr = condition ? &a[idx] : &aTmp; 
    const float* bPtr = condition ? &b[idx] : &bTmp;
    float* cPtr = condition ? &c[idx] : &tmp;
    *cPtr = *aPtr + *bPtr;
}

PTX output becomes a ret

.visible .entry add(float const*, float const*, float*)(
	.param .u64 add(float const*, float const*, float*)_param_0,
	.param .u64 add(float const*, float const*, float*)_param_1,
	.param .u64 add(float const*, float const*, float*)_param_2
)
{
	ret;
}

and its SASS is this:

add(float const*, float const*, float*):
 MOV R1, c[0x0][0x20] 
 NOP 
 NOP 
 NOP 
 EXIT 
.L_x_0:
 BRA `(.L_x_0) 
.L_x_1:

is this a bug in NVCC 12.5.1 (godbolt uses this)?

Perhaps mixing global memory and private memory in same predicate is not legal?

Link to the compiler explorer: Compiler Explorer

godbolt can use at least up to 12.9.1 currently, and many other versions below that. It’s a setting/choice you make. Godbolt can also remember settings you have made previously (i’m not sure how, perhaps by browser signature or cookie). If you want to change the nvcc version in use by godbolt in your example, click on the tab at the top left titled “NVCC 12.5.1 (Editor #1)” and then in the selection box that appears, change “NVCC 12.5.1” to some other value.

As an additional curious observation: in your failing case, if I change the type of the idx variable to unsigned, I observe the presumably “desired” behavior.

I can’t explain it.

yes, it produces this:

add(float const*, float const*, float*):
 MOV R1, c[0x0][0x20] 
 MOV R9, RZ 
 S2R R0, SR_CTAID.X         
 IADD32I R1, R1, -0x8 
 S2R R2, SR_TID.X         
 XMAD.MRG R3, R0.reuse, c[0x0] [0x8].H1, RZ 
 XMAD R2, R0.reuse, c[0x0] [0x8], R2 
 XMAD.PSL.CBCC R0, R0.H1, R3.H1, R2 
 ISETP.GT.U32.AND P0, PT, R0, 0x270f, PT 
 @!P0 SHL R8, R0.reuse, 0x2 
 @!P0 SHR.U32 R6, R0, 0x1e 
 MOV R0, RZ 
 @!P0 IADD R2.CC, R8.reuse, c[0x0][0x148] 
 @!P0 IADD.X R3, R6.reuse, c[0x0][0x14c] 
 @!P0 IADD R4.CC, R8, c[0x0][0x140] 
 @!P0 LDG.E R0, [R2]         
 @!P0 IADD.X R5, R6, c[0x0][0x144] 
 @!P0 LDG.E R9, [R4] 
 LOP.OR R7, R1, c[0x0][0x4] 
 @!P0 IADD R7.CC, R8, c[0x0][0x150] 
 MOV R8, RZ 
 @!P0 IADD.X R8, R6, c[0x0][0x154] 
 LEA R6.CC, R7.reuse, RZ 
 LEA.HI.X P0, R7, R7, RZ, R8 
 FADD R0, R9, R0 
 ST.E [R6], R0, P0 
 EXIT 
.L_x_0:
 BRA `(.L_x_0) 
 NOP
 NOP
.L_x_1:

I’ll use unsigned int rather than int for indexing without negative values or comparisons against negativity.

With your code as-is, the best description (guess) I could offer is that the compiler seems to be imagining that the result of the condition assignment is that condition is always false. That would explain things, and you can try it easily enough with your code/test case. Just set condition to false or true, and observe the result.

But I can’t imagine why the compiler believes that. However I’m not an expert on these things. I know that signed/unsigned comparisons can create trouble, but this doesn’t appear to be that. Also if there is UB, then the compiler is allowed to do unusual stuff, but it doesn’t appear to be that, to me, either.

If no one else comments on it, then if this were my code I would file a bug. I think either it is an actual compiler bug, or else there is some arcane/not obvious signeded-ness artifact or UB-ness artifact happening. Just guesswork here. I can’t explain it.

1 Like

Thank you. I filed a bug. Bug 5515847: Predicated pointer selection from global mem and register address causes empty PTX function body. | NVIDIA Developer

(No worries. Sometimes I ask too many questions, and may cause some irritation due to not following guidelines sometimes, I’m sorry for this.)

Your basic question seems to be whether pointers to registers, pointers to global memory and pointers to shared memory can be stored in the same format.

If the compiler cannot resolve (by reasoning) a pointer to registers at compile-time, it will use local memory.

Apart from that, I know there are generic unified pointers, but it could be that it has to be known at some point, in what state space they point to? Although, it is something basic, I never had the need to mix.

1 Like

For example, if I try to implement a cache inside a kernel like this:

Level 1: in registers
Level 2: in shared memory
Level 3: in device memory
Level 4: in pinned host mapped memory or unified memory


auto level12 = condition1 ? &r_Data : &s_Data;
auto level3 = condition2 ? &d_Data : level12;
auto ptr = condition3 ? &u_Data : level3;
auto result = compute(ptr);
updateCache(ptr, result);

it should be able to use only the selected memory region for the same operation.

My best guess is that the compiler considers something in the code to invoke undefined behavior, which then leads to the generation of an empty function body. Maybe the compiler is in error, but I am not even going to attempt language lawyering this case, as my experience is that I do not hold a candle to compiler engineers in a dispute over what is and is not UB.

When you get a diagnosis back in response to your bug report, it would be helpful if you could update this thread with that information so we all can learn.

1 Like

A pointer to registers will not work (in a performant way) with registers. Registers cannot be indexed during runtime (and still stay registers).

Better select the register vs. something else by value.

1 Like

The way it was explained to me years ago: Thread local data objects are assigned to local memory (which is thread-local storage). As an optimization, the compiler places such data objects into a register whenever possible.

If the address of a thread-local data object is taken and the resulting pointer is manipulated, that data object must remain in local memory, I would think, since there is no runtime indexing into the register file (as @Curefab already stated).

1 Like

There’s an update from Nvinfo:

We can replicate the behavior in house on latest CUDA 13.0 as well . 
Our compiler engineering team will investigate this further , 
we will keep you updated here as well as bring the conclusion back to public Forum when it is done.

So within days, we will know. But as said previously, probably just taking address of register is not allowed (so it takes address of local mem instead).

1 Like

Hi dear developers ,

We can confirm this is a bug , NVBUG 5515847 is fixed and verified on the original described reproducer in this thread . The fix will aim release CUDA 13.2 .

[Edited] The fix is decided to get earlier to CUDA 13.1 due to its severity . FYI .

2 Likes

Thank you for fixing this bug.

Was it about taking address of private register value (not doable maybe?) or having a different size for local pointer (less required bits, etc) and global pointer (48-64bits of pointer used)? Was the signedness of index variable a part of bug?

Probably there are architectural difficulties of combining pointers, but those should just lead to performance degradations, while still keeping up correctness of the program.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.