Uncoalesced access to one float element per thread

I’m experimenting with NCU where in a transpose function, reads are coalesced and write are intentionally uncoalesced (no shared memory). NCU makes the following suggestion regarding the kernel, however, I’m not sure if I understand it.

The memory access pattern for global stores in L1TEX might not be optimal. On average, this kernel accesses 4.0 bytes per thread per memory request; but the address pattern, possibly caused by the stride between threads, results in 32.0 sectors per request, or 32.0*32 = 1024.0 bytes of cache data transfers per request. The optimal thread address pattern for 4.0 byte accesses would result in 4.0*32 = 128.0 bytes of cache data transfers per request, to maximize L1TEX cache performance.

How can I have 1024 Byte of cache date per request? The cace line for L1TEX is 128 Bytes. So, in the worst case, I’ll load one cache line for each float access. That is 128 Bytes, not 1024 Bytes.

The L1 and L2 cache have 128 byte caches lines subdivided into 4 x 32B sectors.

A 32-bit store (or 8/16/32/64/128-bit) where each thread access a different 32B sector results in 32 threads x 1 sector/thread x 32 bytes/ sector = 1024 bytes write through. Even worse these are partial sector writes resulting in additional overhead at both L1 and L2.

1 Like

Thank you, @Greg. I was doing the math per thread not per warp. It makes sense.