CUDA docs of Atomic Functions have code examples with Undefined Behavior

Here:

    unsigned long long int* address_as_ull =
                              (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;

Type punning is Undefined Behavior as per the C/C++ Standard.

Would it be possible to update the example using compliant code?

Since the quoted code snippet appears inside a __device__ function, this is CUDA code, not C++ code. Which means NVIDIA gets to decide whether this constitutes UB or not. I’d say the best way to address this is to file a bug with NVIDIA so they can either (1) clarify the documentation or (2) change the example code.

Fair enough, will report a bug. Thanks!

I have found this in the docs:

Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes. Any access (via a variable or a pointer) to data residing in global memory compiles to a single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned (i.e., its address is a multiple of that size).
If this size and alignment requirement is not fulfilled, the access compiles to multiple instructions with interleaved access patterns that prevent these instructions from fully coalescing. It is therefore recommended to use types that meet this requirement for data that resides in global memory.

My interpretation of the above snippet is:

  • If you read/write global memory via a pointer of correct size and alignment, a single r/w instruction will be generated.
  • Otherwise, multiple instructions will be generated and it will therefore be suboptimal.

Do you share the same interpretation? Would you say this is confirmation that CUDA does define the behavior for what would otherwise be Undefined Behavior in regular C/C++, and it’s therefore safe to use?

I acknowledge that type punning is undefined behavior in C++.

If you want to take a strict view, we don’t need to go any farther.

But if you want to now ask if the code in the programming guide has an actual hazard in it, I think you would first need to explain in what respect you think type punning implies a possible hazard.

I would agree that at the machine level, loading a 64 byte type should not be any different whether that 64 bit type is a 64-bit float or a 64-bit integer.

Regarding your interpretation, I don’t share the same interpretation. Specifically, at the machine level, you should not ever attempt to load a non-naturally-aligned item. It is illegal.

Yes, the compiler in some circumstances will convert e.g. a structure load to multiple loads, but in most cases the compiler will not convert a misaligned pointer to multiple loads. The compiler does not know the pointer is misaligned to begin with.

So for POD types, CUDA programmers, even at the C++ source level, must ensure their pointers and load operations are properly naturally aligned. In the general case, the compiler cannot fix a misaligned load.

And this certainly applies to atomics. The compiler cannot convert a misaligned atomic to an aligned one (nor does it analogously convert a single non-aligned atomic to multiple aligned atomics). The pointer used for a 64-bit atomic must be properly naturally aligned to a 64 bit type. Whether that 64-bit type is actually a 64-bit float or a 64-bit int is a separate issue. For operations that do arithmetic, it matters. for atomic operations that don’t do arithmetic (ike atomicCAS, it doesn’t.

Your original objection is fair, in so far as that sort of strict construction goes.

Your concern about the functional correctness of the actual example is not grounded in anything legitimate, as far as I know, and I wouldn’t use the excerpt you found to completely support it. First it applies to ordinary loads, and not atomics. Second, the statements about converting loads to multiple loads is misleading.

Thanks for the detailed answer!

To provide a bit more background, we are writing safety-critical C++ and CUDA code. There, it’s important that the code not only works, but it’s also correct (conformant to standards and other relevant documentation) and reliable. As such, we take great care about Undefined Behavior and this caught our attention.

I realize my question started with atomics but is more general than that. Let me know if I should rather open a new post that leaves atomics aside and is more general.

The question is: “is reinterpret_cast in device code guaranteed by Nvidia/CUDA to have a well-defined behavior, provided that the documented type/alignment requirements are satisfied?

This is not guaranteed by the ISO C++ Standard (even if type/alignment requirements are satisfied), but I acknowledge that the CUDA implementation might be allowed to deviate from it. Thus I’m asking if there’s documentation that explicitly documents these deviations from the ISO C++ Standard.

in what respect you think type punning implies a possible hazard.

In general, when Undefined Behavior is invoked, the compiler can aggressively optimize code to the point where the assembly code no longer corresponds to what the user intended to do. The hazard is therefore having a resulting assembly that does not function as designed. I trust the examples I picked from the docs are fully functional and harmless, but I’m interested in a more general case.

but in most cases the compiler will not convert a misaligned pointer to multiple loads

The documentation provided by Nvidia states the opposite:

If this size and alignment requirement is not fulfilled, the access compiles to multiple instructions

“compiles” in present tense is expressing unconditional behavior. It doesn’t say “may compile”, “might compile”, “can compile”.

The documentation further states:

It is therefore recommended to use types that meet this requirement for data that resides in global memory.

“Recommended” is not the same as “compulsory”. It certainly does not state “illegal” or “undefined” (which is stated in other parts of the documentation). Would it make sense to update the documentation in that regard?

I fully agree with your reasoning about the machine-level instructions, and that developers must be careful about data types and alignment when doing loads. My question is more oriented at a higher level: given that requirements are satisfied, can a developer, in general, get a guarantee that type punning will do what they expect it to do?

Thanks!

I have probably been too sloppy in my use of english language. The two statements in my view are not opposing to each other.

Let’s consider a few examples.

First let’s consider a struct, with or without explicit alignment:

typedef struct /* __align__(16) */  {

  int3 val1;
  int val2;
}  f;

__global__ void k(f *a){
  f t = a[2];
}

As written, the compiler will surely break the load of the value t into at least 2 loads under the hood, probably 3 or 4. This doesn’t have anything to do with the alignment of the base pointer a. (that pointer must have certain alignment characteristics, but even with those requirements satisfed, the compiler will break the above load into ~4 loads at the machine level, i.e. 4 LDG instructions in SASS.)

That is the context in which

is to be interpreted. We’re not particularly interested in the numerical value of the pointer here (although it matters), we are interested in the size and alignment of the structure, ie. the type in question.

Now suppose we remove the comments around the section in the struct definition. I would say that at that point, we have now met the size (16 bytes or less) and alignment (properly aligned to a 16 byte boundary) of the structure, to allow the compiler to do something different. Now the compiler is free to perform the load of t using a single LDG.128 instruction at the SASS level. That is the context in which this statement is to be interpreted:

(emphasis added)

None of that has very much to do with pointers, including a. (Yes, there are requirements for a, but even if you circumvent the requirement somehow, the compiler may still generate a LDG.128 instruction which might then be illegal.)

Second, considering pointers now, when we are talking about the pointer itself, we can certainly break things if we use a pointer that implies/has a certain alignment, in a usage that requires/expects a different alignment. Here is an example:

T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;

(yes, I am aware this example is probably also type-punning. Maybe add it to your list.)

If pitch used above doesn’t meet some requirements, the compiler may attempt to load pElement in such a way (e.g. a single instruction) that causes a machine fault. The compiler will make decisions about code generation that have implicit assumptions about the underlying pointer usage. So if the pointer is misaligned, the compiler does not “fix” that by somehow generating multiple loads, and that is simply an error you made when you wrote your program. That is the context which I had intended for this statement:

I don’t know anywhere in the documentation where that is stated. If that is your need, I suggest filing a bug.

Thanks, that was very enlightening! I understand the docs much better now.

I will then wait for an answer on my bug report :)