Assumption - for all the below questions I assume, 32 bit word size (data bus size) , data to be operated upon - 16 bit, 32 bit.
memory alignment structure - 32 bytes or 64 bytes
Question 1 -
Assumption for Question 1 - data being operated upon is of size 16 bit, word size (data bus size) 32 bits, and memory alignment structure is 64 bytes.
In the CUDA handbook , it is mentioned “Reading or writing 16-bit words is always uncoalesced.”. Not able to figure out why?
I believe data of size 16 bits can achieve memory coalescing.
My thought process -
16 bit is equivalent to 2 bytes. Suppose, if the memory structure alignment in the GPU is 64 bytes. So, 32 variables of size 2 bytes (16 bits) can be perfectly memory aligned. So successive threads can read data from successive memory location. for example suppose thread_0 can read data from nth memory location, thread_1 can read data from ( n + sizeof(variable_being_operated))th memory location and so on, where sizeof(variable_being_operated) = 16 bits or 2 bytes. So, it seems to me like with the 64 byte memory alignment, 32 bit word size and 32 bit data size coalescing can be achieved.
Obviously I am wrong because the Cuda handbook mentions for 16 bit data size coalescing can not be achieved. What am I missing in my understanding?
Question 2-
Assumption for Question 2 - data being operated upon is of size 32 bit, word size (data bus size) 32 bits, and memory alignment structure is 32 bytes
In the CUDA handbook , under the topic COALESCING constraints, they mention "for successful coalescing, for a 32 bit word, the Memory Alignment criteria should be 64 bytes.". Not able to figure out why?
But I believe for 32 bit data size, 32 byte memory alignment coalescing can be achieved.
My though process - assume the memory alignment structure is 32 bytes and suppose the data being operated upon is of 32 bits (4 bytes). Eight, 4 bytes variables can perfectly align with in a 32 byte memory alignment structure. So successive threads can read data from successive memory location. for example thread_0 can read data from nth memory location, thread_1 can read data from
( n + sizeof(variable_being_operated))th location and so on, where sizeof(variable_being_operated) = 32 bits or 4 bytes. So, it seems to me like with the 32 byte memory alignment, 32 bit word size and 32 bit data size coalescing can be achieved.
Obviously I am wrong because the Cuda handbook mentions for 32 bit variable size to achieve coalescing memory alignment needs to be 64 byte. What am I missing in my understanding?