I have an image in NHWC configuration. The particular type of image I am processing has 3 channels: RGB. The RGB values are saved contiguously, as the image is in NHWC configuration. The kernel is written s.t. each thread processes the 3 RGB values. Let’s assume each element is a float – 4 bytes. If each thread were to read a single value from RGB – all threads first read Red value, then Green, then Blue – it would lead to uncoalesced memory access for both read and write. So, I am wondering if there is a way to issue load for all 3 elements – 12 contiguous bytes of memory – in one transaction so that the reads and writes to global memory become coalesced.
To demonstrate in code, simply doing this would lead to uncoalesced read
float R = in[idx]
float G = in[idx + 1]
float B = in[idx + 2]
so is there a way to issue read / write of 3 float elements – or any number of contiguous bytes – from / to global memory ?