I have a beginner’s question. I need to setup an application where each thread in a warp has to read 32 bytes from the global memory. From what i have read, coalescing allows for blocks of 128 bytes to be transferred through caching to be read by half a warp, so it works fine if a warp thread reads 8 bytes.
Does that mean that i need 4 such memory transactions?
Will that significantly degrade performance?
Many thanks in advance
A thread can read up to 16 bytes in a single request.
For best possible performance, you would want to arrange your data such that the data read by adjacent threads is adjacent.
If you need 32 bytes per thread, the most optimal load pattern would be two reads of 16 bytes each. In order to make this fully optimal, you would need to arrange your data accordingly. The first 16 bytes are arranged in an adjacent fashion in an array, and the second 16 bytes are arranged in an adjacent fashion in a separate array.
There are various questions on the web discussing this kind of optimality with respect to AoS vs. SoA data organization.
Mr. Robert thanks for your reply.