I have a beginner’s question. I need to setup an application where each thread in a warp has to read 32 bytes from the global memory. From what i have read, coalescing allows for blocks of 128 bytes to be transferred through caching to be read by half a warp, so it works fine if a warp thread reads 8 bytes.
Does that mean that i need 4 such memory transactions?
Will that significantly degrade performance?
Many thanks in advance