Half2 atomics generate unused code

I’m trying to understand how various flavours of atomics (add) are implemented at the sass level so I’ve written a few code samples. I’m mostly interested in compute architectures 8.6 and 8.9.

From my tests, it seems that float and int atomic adds are handled with a single dedicated RED instruction. For instance:

global void global_atomicAdd_float(float* restrict a) {
const float value = 1.0f;
atomicAdd(a, value);
}

compiles to the following:

global_atomicAdd_float(float*):
IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28]
IMAD.MOV.U32 R5, RZ, RZ, 0x3f800000
MOV R2, c[0x0][0x160]
ULDC.64 UR4, c[0x0][0x118]
MOV R3, c[0x0][0x164]
RED.E.ADD.F32.FTZ.RN.STRONG.GPU [R2.64], R5
EXIT

I observe very similar sass code for ints, doubles and uint64s, with various overloads of the RED instruction. However, things get much weirder with the half2 type:

global void global_atomicAdd_half2(__half2* restrict a) {
const __half2 value = __float2half2_rn(1.0f);
atomicAdd(a, value);
}

Which gets compiled to the following:

global_atomicAdd_half2(__half2*):
IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28]
MOV R5, 0x3c003c00
IMAD.MOV.U32 R2, RZ, RZ, c[0x0][0x160]
MOV R3, c[0x0][0x164]
ULDC.64 UR4, c[0x0][0x118]
ATOM.E.ADD.F16x2.RN.STRONG.GPU P0, RZ, [R2.64], R5
@P0 EXIT
QSPC.E.S P0, RZ, [R2]
@!P0 BRA (.L_x_4)
ULDC UR4, c[0x0][0x160]
ULOP3.LUT UR4, UR4, 0xffffff, URZ, 0xc0, !UPT
.L_x_5:
LDS R2, [UR4]
IMAD.U32 R0, RZ, RZ, UR4
HADD2 R3, R2, 1, 1
ATOMS.CAST.SPIN R3, [R0], R2, R3
ISETP.EQ.U32.AND P0, PT, R3, 0x1, PT
@!P0 BRA (.L_x_5)
EXIT
.L_x_4:
LD.E R0, [R2.64]
IADD3 R5, R0, 0x3c003c00, RZ
ST.E [R2.64], R5
EXIT

As we can see, the RED instruction is replaced with an ATOM instruction which returns a predicate P0.
Based on the predicate, the threads will exit immediately or perform the addition with a compare and swap idiom in shared memory. I’m guessing that the QSPC.E.S instruction tests if the pointer is in shared memory or not. But it’s weird to do so because the pointer is a global memory pointer since it’s a kernel argument. Running the kernel with nsight compute shows that none of the instructions after the first EXIT are executed so the added code is entirely useless.

I also tried using “__builtin_assume(__isGlobal(a));” before the atomic add but it has no effect on the generated code.
What’s going on? Is it a bug or is it expected behavior? Is there any way to tell the compiler to remove the extraneous code?

Only NVIDIA can answer this authoritatively, but what this looks like is that there is no native hardware support for half2 atomics, which are therefore emulated.

The “code that never gets executed” is presumably required for functional correctness but is executed infrequently in {normal usage | this particular context}. In other words, it may be the so called “slow path” of this particular emulation sequence. I don’t recognize the encoding 0x3c00 off the top of my head, but I think that might be 1.0 in FP16.

If you find that this emulation code causes a significant performance reduction compared to atomics for other data types, you could always file an enhancement request with NVIDIA (via the bug reporting form).

Correct: QSPC stands for Query Space and .s for shared memory

Could it be related to Flush to zero?

The float version in your SASS assembly uses FTZ.

According to PTX ISA 8.5

atom.add.f16, atom.add.f16x2, atom.add.bf16 and atom.add.bf16x2 operation requires the .noftz qualifier; it preserves subnormal inputs and results, and does not flush them to zero.

Perhaps you could also try inline assembly with those PTX instructions to see, if they are directly translated.

It is.

Just speculating. I have not attempted to confirm anything. Just using my brainpower and what I believe is public information.

I think the denormal flushing comments are interesting, but at the CUDA C++ level, I don’t know that flushing behavior for atomics is specified. Perhaps someone will tell me if I am wrong. Alternatively a test case could be constructed… My speculation is that the flushing behavior might be different between different overloads of atomicAdd, and perhaps even for state-space differences.

This threw me for a loop:

For two reasons:

  1. That doesn’t look atomic to me.
  2. On what planet can you do an integer add of a floating point bit pattern with another floating point bit pattern and expect any sensible results?

I don’t have a speculative response to the second one. Perhaps some genius will point out something I don’t know about equivalences between integer and floating point arithmetic.

Regarding the first one, I puzzled over the question of when could a non-atomic add possibly be acceptable? I speculate that the answer may be when a non-_system variant is used on a system-space address (which, AFAIK, is technically a sub-space within the global logical state-space).

It’s also intriguing to me to ask the question, “what does it mean to assign the result of an atomic to a predicate variable”? One idea was the obvious one: if the result is zero, the predicate gets zero. Any other result the predicate gets 1. But in order for my speculation to hold up, I am proposing that the predicate gets a 1 if the operation was “successfully” performed on a global but non-system space address, and gets zero otherwise. (FWIW, I couldn’t find any explict mention of predicate usage - e.g. as a destination operand - with atomics in the PTX manual. I may have missed it.)

For atomics, for the “otherwise” case, the two remaining spaces are system and shared. We can use the QSPC instruction to sort that out, and if it is a shared address then we get the shared atomic variant. If it is not a shared state-space address, then we get the only thing left, a system space address. In that case, the atomic need not be atomic.

I got some clues by going back to earlier architectures and playing with different atomic routines (both __global__ and __device__) on godbolt. A few observations:

  • Even on sm_89, I can generate the 3-prong variant for the atomic handler by switching from a kernel/__global__ test case to a __device__ test case. It seems that in the kernel/global test case, the compiler is making a different decision about what is acceptable.

  • On sm_70, we get the same 3-pronged handler, and in this case the non-atomic prong has .SYS markers

So I have at least 2 unanswered questions (and of course the entire idea here may be bogus).

  1. Its understood that in the __device__ test case, the compiler must view the incoming pointer as “generic” or ambiguous. It must handle any necessary cases (3, I guess). However in the __global__ kernel test case, the compiler can know that the incoming pointer is not __shared__. It seems to make this determination in e.g. the float test case, by giving a very simple routine for the global atomic, but not in the __half2 test case.

  2. In the sm_70 godbolt example I linked above, the compiler creates a “reasonable” fallback case for the non-atomic update of the system space, using FADD for the float atomicAdd variant. In the __half2 case, why does the compiler use the odd construct of IADD3 rather than using some construct involving HADD-like instructions?

There are other things I haven’t explained as well. I found support for these going all the way back to sm_52, but in some cases we get a 2-pronged handler rather than 3-pronged. It might be that the global atomic (ATOM/RED) in that case behaves differently for system state space or is not hardwired to be able to feed a predicate register.

It seems like it should be easy to do a first-level test of this using a profiler and targetting a system space address (i.e. host pinned memory), to see if the execution hotspots/sampling changes at the SASS level

Notes:

  1. system space is not formally an actual PTX state space. Please excuse me, I’m using made-up terminology to refer to a subset of the global state space that targets host memory.
  2. “system space” as I have defined it, and “atomic system scope” are related ideas but not the same thing

This kind of pattern comes about naturally when one manipulates the exponent field of floating-point numbers, in particular for the purpose of scaling by powers of two. What function it could have here I do not know.

Things are even weirder now that I look at the generated sass more closely, here is a pseudo-code interpretation. Feel free to correct me if I’m wrong.

global void global_atomicAdd_half2(__half2*restrict a) {
const __half2 value = __float2half2_rn(1.0f);
if(atomicAdd(a, value)) { // test the predicate P0
return; // this is the only code path that gets executed, even with heavy contention
}

if(__isShared(a)) { // a is obviously not a shared memory address

while(true) { // compare and swap idiom
__half2 v_old = *a; // a should be volatile for correctness but this is just pseudo-code
__half2 v_new = v_old + 1; // HADD2 R3, R2, 1, 1 ← perhaps there’s an implicit conversion to fp16 ?
if(atomicCAS(a, v_old, v_new)) break;
}

}else {
// this path could be taken in case a points to a host address ?
// Could make sense if it was an atomicAdd_system but it is not the case here
// this part makes no sense to me, it’s not atomic and the integer add won’t be correct in many cases
int v = *(int *)a; // load as int
v += 0x3c003c00; // integer add of the bit pattern
*(int *)a = v; // store as int
}

In any case, I don’t think the flushing behavior is relevant here, especially since there is no control over it for atomic operations.
I hope someone from nvidia can clear that up, because it really looks like a bug in my opinion.

Edit: atomicAdd_system works for the float variant (it adds a .SYS on the RED instruction) but does not exist for the __half2 type.

--ftz=true or --ftz=false does not change code generation for that branch. So it probably is not related.

(On the other hand FP32 exists with both variants - FTZ and no-FTZ; FP16, BF16 and their vector variants and FP64 only without FTZ.)

This would be only correct, if *a was (0x00000000) or { 0.f16, 0.f16 }; before.
Then it would (still non-atomically) just replacing with { 1.f16, 1.f16 };.

The formula does not work in sub-normal or other special cases either (just trying to find a reason for it, when it would make sense).

And the if (atomaticAdd) could return result / value information in the predicate for only one of the two fp16 values anyway.

I guess it is some (wrong) code left over for short2 processing? But even then there could be an overflow between bit 15 and 16?

The actual emulation template is likely more complicated than what we see here.

The CUDA compiler propagates compile-time constants aggressively, which can leave some strange atrophied code branches behind. Such remnants may not be cleaned up completely due to phase ordering issues (or other reasons that I have no insight into). I have seen that effect with calls to sqrtf(), for example, and the remnants of the slow-path handling left behind in that case looked very nonsensical as well.

Historically, NVIDIA has not been in the habit of explaining implementation artifacts like internal emulation templates to the public at large.

If you think the code is functionally incorrect, you would want to prepare a small, self-contained reproducer program that demonstrates the incorrect behavior reliably, then submit a bug report.

I had also considered the possibility that the 3rd prong is for handling a local-space address. That seems less likely to me, however, and local space is explicitly not supported by atomics. But I haven’t looked into the behavior if you pass a generic address to an atomic that happens to be in the local space.

1 Like

I’ve submitted a bug report linking to this post.

1 Like

Thanks for filing a bug ticket . This maps to nvbug ID 4783892 . We will interact with requester in bug ticket and bring back conclusion here once it is completed .

The following info was shared by the engineering team:

atomicAdd() expects a generic pointer and hence implementation of that is considering several cases for completeness. In some cases compiler may not be able to handle and determine state space pointed by a pointer and eliminate some of these paths as dead code. In this particular case similar thing is happening given how atomicAdd() is implemented.

To get desired code for this specific case, you can choose to use atom.global directly via inline PTX and that should produce desired SASS.
Do note: inline asm expects all pointers passed to be generic pointers so need to add cvta to convert to state specific pointer

Rewriting the kernel in following way will give desired SASS:

#define __HALF2_TO_CUI(var) *(reinterpret_cast<const unsigned int *>(&(var))) 
__global__ void global_atomicAdd_half2(__half2 *a) { 
  const __half2 value = __float2half2_rn(1.0f); //atomicAdd(a, value); 
  asm volatile("cvta.global.u64 %0, %0;\n atom.global.add.noftz.f16x2 _, [%0], %1;" :: "l"(a), "r"(__HALF2_TO_CUI(value)) : "memory");
  }
2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.