I’m reading the best practices guide and it says
“For devices of compute capability 6.0 or higher, the requirements can be summarized quite easily: the concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of 32-byte transactions necessary to service all of the threads of the warp.”
I’m wondering what the criteria is for memory accesses to be grouped.
I get that writing float4’s in consecutive locations is probably all grouped, but what about something like:
if (threadIdx.x % 2 == 0) {
arr[threadIdx.x] = threadIdx.x;
} else {
arr[threadIdx.x] = threadIdx.x;
}
I get that the compiler is probably smart enough to work around this example, but as this type of thing gets more complex, is there a general rule for delaying writes to see if they all coalesce? I could see a warp committing all possible writes that aren’t dependent on reads to memory and then anticipating a stall while they all go out. Does anyone know if it works something like this?