Alignment Requirement Single instructions

Hi all,

I have some questions regarding the right access pattern to get maximum memory bandwidth.

In the programming guide on page 44 it says that the device is capable of reading 32 - bit , 64 - bit and 128 bit from global memory into registers in a single load instructions,

__device type device[32];

type data = device[tid];

when sizeof(type) is equal to 4, 8, or 16 bytes and when variables of type type is aligned to 4, 8, or 16 bytes (that is, have 2,3, or 4 least signigicant bits of their address equal to zero).

  1. I do understand the first constraint but I do not know how to assure the second condition. For instance if type is float then I have met the first requirement but I do not know whether the second one is fulfilled or not ?! How do I meet the second requirement and if not possible with this type, i.e. float how do I do a conversion.

  2. Does it also work for loading global memory data into shared memory data ?

thx for the answer in advance!


  1. When you allocate memory using cudaMalloc it is always 256 byte aligned (ie, last 8 bits are set to zero). This means that you can store an array of any 1,2,4,…64,128 sized type here and it will still be aligned.
    If you use structures you should also make sure their size is a power of two, or use the align(n) pragma.

  2. The alignment requirement is there both for shared as for global memory.