if I have a kernel writing multiple values per thread to global memory in sequence, like
int location = (blockIdx.x * blockDim.x + threadIdx.x) * 8; global[location ] = value1; global[location + 1] = value2; global[location + 2] = value3; global[location + 3] = value4; global[location + 4] = value5; global[location + 5] = value6; global[location + 6] = value7; global[location + 7] = value8;
where global memory variable and values are a float3. So each thread writes 8x float3 one after another, and all threads write to consecutive addresses.
My question: Can these memory access inside and between threads be coalesced on a Fermi 2.0 (GeForce GTX 480) device?
In principle, the writes are all consecutively. Each thread writes 8 * 3 * 4byte = 96byte, which could be done in a 64byte and a 32byte memory transaction (for each thread). Or thread 0 could be combined with the first 32byte of the next thread to one complete 128byte transaction, then the remaining 64bytes of thread 1 with the first 64bytes of thread 2 and so on.
My nightmare would be, that each value is written sequentially.
The CUDA Programming Guide isn’t really clear about this.
Thanks for your answers in advance.