Alignment requirements, shared memory

alexeifinski337 · August 30, 2024, 10:04pm

The above CUDA example in https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/ raises questions:

Is the start of the dynamically sized array located at the strictest alignment requirement, including for vectorized int4, or at the alignment requirement of int?
Is the conversion of int* to float* a UB, because the alignment requirement of float may be stricter on a system? This would be a UB even at a suitable location in memory.
Would the dereferencing of float* violate the strict aliasing rules, because the effective type is int. If the array object has no effective type, why is there “int”?

Robert_Crovella · August 30, 2024, 11:13pm

I don’t know that there is a documented guarantee anywhere that the dynamically allocated shared memory pointer will be aligned to anything larger than the type you specify. In that case, you could do this:

extern __shared__ float4 s[];

That then demonstrably has the necessary alignment, or CUDA is fundamentally broken.

A similar question is here.

I imagine various CUDA examples violate the strict aliasing rule. I’m not a language expert most regards. However I believe you can find more examples in CUDA documentation or blogs of this type of casting. For example I imagine that another one is here. Another example is discussed here. I’ve never worried about it, as long as you respect the requirement for natural alignment on access, as well as of course allocating enough space to begin with.

alexeifinski337 · August 30, 2024, 11:51pm

Thanks. Is there a guarantee that i) double4 has the strictest alignment requirement or ii) 32 is the strictest alignment requirement, such that a subarray of any other type can start at this boundary? Having consistency in code at the expense of not using a few bytes may be a meaningful tradeoff.

Robert_Crovella · August 31, 2024, 12:07am

I may be misinterpreting what you are asking. You might want to become familiar with the device memory access section of the programming guide, and also note the alignment requirements for vector types.

From the device memory access section:

Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes.

(and I don’t believe shared memory accesses can exceed 16 bytes per transaction/“wavefront” - per thread.)

Therefore, for me, anyway, this informs as to why the alignment requirement for double4 can be 16, not 32.

Also note:

The alignment requirement is automatically fulfilled for the Built-in Vector Types.

However, that statement should not be abused. “Forcefully” misaligning a pointer, then casting that pointer to a vector type, is not covered by the above stated “requirement is automatically fulfilled”.

Based on the above treatment, I personally would have no concerns in CUDA about taking the base dynamically allocated shared pointer that had a declared type of e.g. int4, and using that starting address as the starting address for any of the other types listed in that table of built-in vector types, or the starting address of an array of one of those types. That’s my view of the CUDA world, I’m not making a language-lawyer statement or a C++ statement.

This should also not be construed as “you can’t load a double4 per thread”. You can. But the compiler, under the hood, will split that into (at least) 2 machine instructions. Thus we end up at 16 as the “strictest” alignment requirement for the CUDA GPU.

alexeifinski337 · August 31, 2024, 1:00am

Apologies. Yes, the alignment requirement for double4 is 16.

According to the statement “Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least 256 bytes.”, double4 in a double4 array is aligned at 32 boundary, because there is no padding in an array, but has alignment requirement 16.

Based on our discussion, there appears to be no guarantee that the start of a dynamically sized/allocated array in shared memory is aligned at >= 256 bytes.

And the maximum transaction size would set the strictest alignment requirement for basic and vector types to 16 in shared memory.

Robert_Crovella · August 31, 2024, 1:03am

By my read, that doesn’t cover dynamically allocated shared memory.

Not sure why that would be needed, ever. An alignment of greater than 32 bytes could be needed or useful in global memory in some situations, perhaps. I’m not aware of any concept like that for shared memory usage.

In my comments here, I’ve tried to stay close to what can be deduced from documentation and written sources. It would not surprise me at all if the actual dynamic shared allocation mechanism had an alignment that is larger than e.g. 16 bytes. But I don’t know how to deduce that from documentation, and I’m unable to identify a situation where it would be needed.

alexeifinski337 · August 31, 2024, 1:32am

Having an alignment requirement higher than 16 may not be currently useful in shared memory.

However, knowing that every dynamically allocated array in shared memory, regardless of the declared pointer type, starts at the strictest practical alignment requirement is useful.

Curefab · August 31, 2024, 2:52pm

In those cases I would prefer to create a struct type.

extern __shared__ struct {
    int integerData[nI];
    float floatData[nF];
    char charData[nC];
} s;

Or are nI, nF and nC no compile-time constants?

Then you could do an array of unions of int, float and char4, which should all have the same size and alignment.

typedef union {
    int integerData;
    float floatData;
    char4 charData;
} ifc4;

extern __shared__ ifc4 s[]; // size: nI + nF + (nC + 3) / 4
int* integerData = &s[0].integerData;
float* floatData = &s[nI].floatData;
char* charData = &s[nI + nF].charData;

This could also have UB, because just pointing to the first floatData element does not make all floatData elements the active union elements. Also, if e.g. nC is 0, we could point behind the length of the array.

There are some types, which you can legally cast from in C++ in any case: char, unsigned char and byte (signed char is not part of the general list, although on many platforms char may be defined as signed char).
So making on of those the array type together with alignas(4) should also work.

extern __shared__ alignas(4) std::byte s[]; // size: 4 * nI + 4 * nF + nC
int* integerData = reinterpret_cast<int*>(&s[0]);
float* floatData = reinterpret_cast<float*>(&s[4 * nI]);
char* charData = reinterpret_cast<char*>(&s[4 * nI + 4 * nF]);

This probably is the cleanest solution in regards to UB.

alexeifinski337 · August 31, 2024, 5:38pm

Yes. There are workarounds. For example one can have a struct with different member types, which are the types of the to-be-created subarrays, and then dynamically allocate an array in shared memory according to the alignment requirement of the struct, which would be the strictest alignment requirement of any member types of the struct. One can then consistently use the sizeof size of the struct, which would be padded to meet the alignment requirement of the member type with the strictest alignment requirement, for selecting the locations of the subarrays of each type in the allocated array. This solution only requires knowing the types of the sub-arrays at compile-time.

Having a guarantee is better. Memory alignment is important and complete information is needed. The 256 byte statement was provided with respect to global memory. A similar statement is needed with respect to shared memory.

In the following post basic type arrays in global memory were accessed through vectorized types. Does this always work in shared memory? Can I always access the start of a dynamically allocated array of a basic type in shared memory with a corresponding vectorized type with a stricter alignment requirement?

alexeifinski337 · September 2, 2024, 4:35am

In the below public repository, an assumption was made that the start of a dynamically allocated array in shared memory was aligned at >= 16, although the pointer type was char*.

Are all dynamic allocations in shared memory aligned at >= 16?

Curefab · September 2, 2024, 9:36am

You probably should not assume that.
In the mentioned case it could be that the dynamic shared memory started at offset 0.

Just use alignas.

alexeifinski337 · September 2, 2024, 3:45pm

Hardcoding alignments is not a general solution. It may not be portable and is error-prone. Alignment requirements can change with new architectures.

The above struct solution does not depend on knowing alignment requirements or hardcoding alignments. It always satisfies the alignment requirements for the types chosen at compile-time, even if such alignment requirements change. However, it creates an asymmetry between the use of the vectorized types in global memory and shared memory.

If there was a stated guarantee that the start of a dynamic allocation in shared memory starts at the strictest alignment requirement, then there would not be a need to deal with alignments at all. The alignment requirement of type T is always less or equal to sizeof(T) due to no padding between the elements in an array. One could completely rely on type sizes in computing the positions of subarrays (with a possibility of over-alignment and not using a few bytes).

The mentioned repository example does not look correct, unless there exists an alignment guarantee for a dynamic allocation in shared memory that is not stated in the documentation.

Topic		Replies	Views
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11640	March 20, 2009
Using Shared Memory in CUDA C/C++ Technical Blog	36	1922	October 8, 2020
Structure Alignment? CUDA Structure Alignment differs? CUDA Programming and Performance	12	49183	December 11, 2008
Reading from float4 corrupts memory CUDA Programming and Performance cuda	18	34	October 3, 2024
CUDA Memory Consistency CUDA Programming and Performance	23	55438	March 8, 2007
Question about extern __shared__ CUDA Programming and Performance	4	1920	March 10, 2013
CUDA docs of Atomic Functions have code examples with Undefined Behavior CUDA Programming and Performance	7	737	June 17, 2022
Coalesced access to global memory for double4 CUDA Programming and Performance	8	3361	September 8, 2015
CUDA Pro Tip: Increase Performance with Vectorized Memory Access Technical Blog	23	2313	December 11, 2024
Constant memory allocation and initialization CUDA Programming and Performance	12	82419	November 20, 2010

Alignment requirements, shared memory

Related topics