load / store from/to register-array to global memory

I have an image in NHWC configuration. The particular type of image I am processing has 3 channels: RGB. The RGB values are saved contiguously, as the image is in NHWC configuration. The kernel is written s.t. each thread processes the 3 RGB values. Let’s assume each element is a float – 4 bytes. If each thread were to read a single value from RGB – all threads first read Red value, then Green, then Blue – it would lead to uncoalesced memory access for both read and write. So, I am wondering if there is a way to issue load for all 3 elements – 12 contiguous bytes of memory – in one transaction so that the reads and writes to global memory become coalesced.

To demonstrate in code, simply doing this would lead to uncoalesced read

float R = in[idx]
float G = in[idx + 1]
float B = in[idx + 2]

so is there a way to issue read / write of 3 float elements – or any number of contiguous bytes – from / to global memory ?

No. Device memory accesses per thread come in possible sizes of 1,2,4,8, or 16 bytes per transaction per thread. This is documented in the “device memory access” section of the programming guide. You cannot do 12 bytes per thread.