Is mixing two different memory type pointers in same predication allowed?(global mem and register)

tugrul_192bit · September 12, 2025, 3:51pm

In an attempt to write vector-addition without a branch, I tried predicated operations:

__global__ void add(const float* a, const float* b, float* c) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    const float aTmp = 0.0f;
    const float bTmp = 0.0f;
    __shared__ float tmp;
    const bool condition = idx < 10000;
    const float* aPtr = condition ? &a[idx] : &aTmp; 
    const float* bPtr = condition ? &b[idx] : &bTmp;
    float* cPtr = condition ? &c[idx] : &tmp;
    *cPtr = *aPtr + *bPtr;
}

output on godbolt for PTX:

.visible .entry add(float const*, float const*, float*)(
	.param .u64 add(float const*, float const*, float*)_param_0,
	.param .u64 add(float const*, float const*, float*)_param_1,
	.param .u64 add(float const*, float const*, float*)_param_2
)
{
	mov.u64 	%SPL, __local_depot0;
	cvta.local.u64 	%SP, %SPL;
	ld.param.u64 	%rd8, [add(float const*, float const*, float*)_param_0];
	ld.param.u64 	%rd9, [add(float const*, float const*, float*)_param_1];
	ld.param.u64 	%rd6, [add(float const*, float const*, float*)_param_2];
	add.u64 	%rd10, %SP, 0;
	add.u64 	%rd11, %SPL, 0;
	add.u64 	%rd12, %SP, 4;
	add.u64 	%rd13, %SPL, 4;
	mov.u32 	%r1, %ntid.x;
	mov.u32 	%r2, %ctaid.x;
	mov.u32 	%r3, %tid.x;
	mad.lo.s32 	%r4, %r2, %r1, %r3;
	mov.u32 	%r5, 0;
	st.local.u32 	[%rd11], %r5;
	st.local.u32 	[%rd13], %r5;
	setp.gt.s32 	%p1, %r4, 9999;
	setp.lt.s32 	%p2, %r4, 10000;
	cvt.s64.s32 	%rd1, %r4;
	mul.wide.s32 	%rd14, %r4, 4;
	add.s64 	%rd15, %rd8, %rd14;
	selp.b64 	%rd2, %rd15, %rd10, %p2;
	add.s64 	%rd16, %rd9, %rd14;
	selp.b64 	%rd3, %rd16, %rd12, %p2;
	mov.u32 	%r6, add(float const*, float const*, float*)::tmp;
	{ .reg .b64 %tmp;
		cvt.u64.u32 	%tmp, %r6;
		cvta.shared.u64 	%rd18, %tmp; }
		@%p1 bra 	$L__BB0_2;
		shl.b64 	%rd17, %rd1, 2;
		add.s64 	%rd18, %rd6, %rd17;
$L__BB0_2:
		ld.f32 	%f1, [%rd3];
		ld.f32 	%f2, [%rd2];
		add.f32 	%f3, %f2, %f1;
		st.f32 	[%rd18], %f3;
		ret;
	}

but when I change the output predication to a register instead of shared memory:

__global__ void add(const float* a, const float* b, float* c) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    const float aTmp = 0.0f;
    const float bTmp = 0.0f;
    float tmp; // now a register
    const bool condition = idx < 10000;
    const float* aPtr = condition ? &a[idx] : &aTmp; 
    const float* bPtr = condition ? &b[idx] : &bTmp;
    float* cPtr = condition ? &c[idx] : &tmp;
    *cPtr = *aPtr + *bPtr;
}

PTX output becomes a ret

.visible .entry add(float const*, float const*, float*)(
	.param .u64 add(float const*, float const*, float*)_param_0,
	.param .u64 add(float const*, float const*, float*)_param_1,
	.param .u64 add(float const*, float const*, float*)_param_2
)
{
	ret;
}

and its SASS is this:

add(float const*, float const*, float*):
 MOV R1, c[0x0][0x20] 
 NOP 
 NOP 
 NOP 
 EXIT 
.L_x_0:
 BRA `(.L_x_0) 
.L_x_1:

is this a bug in NVCC 12.5.1 (godbolt uses this)?

Perhaps mixing global memory and private memory in same predicate is not legal?

Link to the compiler explorer: Compiler Explorer

Robert_Crovella · September 12, 2025, 4:42pm

godbolt can use at least up to 12.9.1 currently, and many other versions below that. It’s a setting/choice you make. Godbolt can also remember settings you have made previously (i’m not sure how, perhaps by browser signature or cookie). If you want to change the nvcc version in use by godbolt in your example, click on the tab at the top left titled “NVCC 12.5.1 (Editor #1)” and then in the selection box that appears, change “NVCC 12.5.1” to some other value.

As an additional curious observation: in your failing case, if I change the type of the idx variable to unsigned, I observe the presumably “desired” behavior.

I can’t explain it.

tugrul_192bit · September 12, 2025, 4:47pm

yes, it produces this:

add(float const*, float const*, float*):
 MOV R1, c[0x0][0x20] 
 MOV R9, RZ 
 S2R R0, SR_CTAID.X         
 IADD32I R1, R1, -0x8 
 S2R R2, SR_TID.X         
 XMAD.MRG R3, R0.reuse, c[0x0] [0x8].H1, RZ 
 XMAD R2, R0.reuse, c[0x0] [0x8], R2 
 XMAD.PSL.CBCC R0, R0.H1, R3.H1, R2 
 ISETP.GT.U32.AND P0, PT, R0, 0x270f, PT 
 @!P0 SHL R8, R0.reuse, 0x2 
 @!P0 SHR.U32 R6, R0, 0x1e 
 MOV R0, RZ 
 @!P0 IADD R2.CC, R8.reuse, c[0x0][0x148] 
 @!P0 IADD.X R3, R6.reuse, c[0x0][0x14c] 
 @!P0 IADD R4.CC, R8, c[0x0][0x140] 
 @!P0 LDG.E R0, [R2]         
 @!P0 IADD.X R5, R6, c[0x0][0x144] 
 @!P0 LDG.E R9, [R4] 
 LOP.OR R7, R1, c[0x0][0x4] 
 @!P0 IADD R7.CC, R8, c[0x0][0x150] 
 MOV R8, RZ 
 @!P0 IADD.X R8, R6, c[0x0][0x154] 
 LEA R6.CC, R7.reuse, RZ 
 LEA.HI.X P0, R7, R7, RZ, R8 
 FADD R0, R9, R0 
 ST.E [R6], R0, P0 
 EXIT 
.L_x_0:
 BRA `(.L_x_0) 
 NOP
 NOP
.L_x_1:

I’ll use unsigned int rather than int for indexing without negative values or comparisons against negativity.

Robert_Crovella · September 12, 2025, 5:03pm

With your code as-is, the best description (guess) I could offer is that the compiler seems to be imagining that the result of the condition assignment is that condition is always false. That would explain things, and you can try it easily enough with your code/test case. Just set condition to false or true, and observe the result.

But I can’t imagine why the compiler believes that. However I’m not an expert on these things. I know that signed/unsigned comparisons can create trouble, but this doesn’t appear to be that. Also if there is UB, then the compiler is allowed to do unusual stuff, but it doesn’t appear to be that, to me, either.

If no one else comments on it, then if this were my code I would file a bug. I think either it is an actual compiler bug, or else there is some arcane/not obvious signeded-ness artifact or UB-ness artifact happening. Just guesswork here. I can’t explain it.

tugrul_192bit · September 12, 2025, 5:13pm

Thank you. I filed a bug. Bug 5515847: Predicated pointer selection from global mem and register address causes empty PTX function body. | NVIDIA Developer

(No worries. Sometimes I ask too many questions, and may cause some irritation due to not following guidelines sometimes, I’m sorry for this.)

Curefab · September 13, 2025, 11:23am

Your basic question seems to be whether pointers to registers, pointers to global memory and pointers to shared memory can be stored in the same format.

If the compiler cannot resolve (by reasoning) a pointer to registers at compile-time, it will use local memory.

Apart from that, I know there are generic unified pointers, but it could be that it has to be known at some point, in what state space they point to? Although, it is something basic, I never had the need to mix.

tugrul_192bit · September 13, 2025, 12:53pm

For example, if I try to implement a cache inside a kernel like this:

Level 1: in registers
Level 2: in shared memory
Level 3: in device memory
Level 4: in pinned host mapped memory or unified memory


auto level12 = condition1 ? &r_Data : &s_Data;
auto level3 = condition2 ? &d_Data : level12;
auto ptr = condition3 ? &u_Data : level3;
auto result = compute(ptr);
updateCache(ptr, result);

it should be able to use only the selected memory region for the same operation.

njuffa · September 13, 2025, 11:04pm

My best guess is that the compiler considers something in the code to invoke undefined behavior, which then leads to the generation of an empty function body. Maybe the compiler is in error, but I am not even going to attempt language lawyering this case, as my experience is that I do not hold a candle to compiler engineers in a dispute over what is and is not UB.

When you get a diagnosis back in response to your bug report, it would be helpful if you could update this thread with that information so we all can learn.

Curefab · September 14, 2025, 1:40pm

A pointer to registers will not work (in a performant way) with registers. Registers cannot be indexed during runtime (and still stay registers).

Better select the register vs. something else by value.

njuffa · September 14, 2025, 6:22pm

The way it was explained to me years ago: Thread local data objects are assigned to local memory (which is thread-local storage). As an optimization, the compiler places such data objects into a register whenever possible.

If the address of a thread-local data object is taken and the resulting pointer is manipulated, that data object must remain in local memory, I would think, since there is no runtime indexing into the register file (as @Curefab already stated).

tugrul_192bit · September 17, 2025, 3:17pm

There’s an update from Nvinfo:

We can replicate the behavior in house on latest CUDA 13.0 as well . 
Our compiler engineering team will investigate this further , 
we will keep you updated here as well as bring the conclusion back to public Forum when it is done.

So within days, we will know. But as said previously, probably just taking address of register is not allowed (so it takes address of local mem instead).

Yuki_Ni · September 26, 2025, 3:32am

Hi dear developers ,

We can confirm this is a bug , NVBUG 5515847 is fixed and verified on the original described reproducer in this thread . The fix will aim release CUDA 13.2 .

[Edited] The fix is decided to get earlier to CUDA 13.1 due to its severity . FYI .

tugrul_192bit · September 26, 2025, 7:11am

Thank you for fixing this bug.

Was it about taking address of private register value (not doable maybe?) or having a different size for local pointer (less required bits, etc) and global pointer (48-64bits of pointer used)? Was the signedness of index variable a part of bug?

Curefab · September 26, 2025, 10:28am

Probably there are architectural difficulties of combining pointers, but those should just lead to performance degradations, while still keeping up correctness of the program.

system · October 10, 2025, 10:28am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Possible bug: global variables, predication nvcc-generated PTX contains race condition on store to g CUDA Programming and Performance	0	929	September 17, 2009
pointer to shared memory compiler problems CUDA Programming and Performance	19	14795	June 7, 2008
Failure of NVCC to recognize memory type CUDA Programming and Performance	4	5692	August 22, 2007
predication in inline PTX CUDA Programming and Performance	2	2637	February 2, 2015
Error: "unaligned memory accesses not supported"? CUDA Programming and Performance	6	11196	April 2, 2009
Strange memory gremlins Getting pwned by pointers CUDA Programming and Performance	9	12274	July 1, 2009
Qualifying Pointers referring to const/shared data CUDA Programming and Performance	6	3863	December 17, 2007
Compiler error: Predicate variable must be in register state space CUDA Programming and Performance	2	2466	March 25, 2010
Pointer arithmetic with shared memory CUDA Programming and Performance	1	1460	February 7, 2010
How to tell a pointer points to global memory CUDA Programming and Performance	7	4406	February 1, 2008

Is mixing two different memory type pointers in same predication allowed?(global mem and register)

Related topics