Basic Question

Dear All

A basic question:

align(n) of an array in CUDA -means that the first element of the array is align to n bytes or

each element of the array is align to n bytes


Luis Gonçalves

In CUDA, the align directive applies to structures:

I’m not aware of any other documented usage. njuffa or others may know otherwise.

The starting point of an array of basic types should always be aligned to the necessary type by the compiler (assuming you are not doing illegal pointer arithmetic to offset the start of the array). Thereafter, each element of the array of basic types will be naturally aligned, by definition.

The starting point and structure element size of an array of structures will be chosen by the compiler, but you can modify it with the align directive. The align directive when used this way will essentially force each structure (element) in the array to fall on the specified boundary.

The purpose of the align directive is to over-align data, i.e. align it to a larger bound than would normally be picked by the compiler. For any aggregate type, the compiler will chose alignment necessary to read all components of that type successfully under the restrictions imposed by the architecture.

It is curious that the CUDA documentation only mentions align in conjunction with structs. Alignment storage class modifiers exist in pretty much all toolchains, and they can normally be applied to any aggregate object e.g.

__align__(8) uint8_t buffer[256];

I have never tried this in CUDA, so I do not know whether the above actually works (and more importantly, is designed to work) in a CUDA program. Alignment storage class modifiers are top-level modifiers, so the semantics of the above array declaration would be that the start of the byte-buffer is aligned to an address that is a multiple of 8, not that every array element is placed at an address divisible by 8.

Is the original question of a philosophical nature, or is there an actual programming issue that needs to be addressed?

Dear All

I have something like this in a kernel

float r;
float i;
} complex1;

global kernel(xxxxxxxx)

align(64) complex1 a[xxx];
align(64) complex1 b[xxx];
align(64) complex1 c[xxx];
align(64) complex1 d[xxx];


With the above align I obtained runtime gains and I wanted to know what was happening (I tried with 8 16 32 and 64 works better). If only the first element is aligned or I was using lots of memory and all elements are aligned.


Luis Gonçalves