L1 Cache Effective Bandwidth

s

It does? It might depend on the GPU. I suspect on some GPUs the smallest granularity access here is not 128 bytes but 32 bytes.

From here:

On Pascal the data access unit is 32B regardless of whether global loads are cached in L1. So it is no longer necessary to turn off L1 caching in order to reduce wasted global memory transactions associated with uncoalesced accesses.

I think Pascal and beyond have this characteristic, but I haven’t scrubbed all the tuning guides.

I’m not aware of any method to get below a 32B access granularity to global memory, so I think that is the bound.

L1 and L2 on CC 3.x - 9.x have 128 byte cache lines comprised of 4 x 32-byte sectors. Accesses from L1 to L2 are in quantities of 32 byte accesses.

Return throughput from L1/SHM can vary with GPU. The standard rate on existing GPUs is

  • shared/local/global returns 128 bytes/cycle
  • texture/Surface returns 64 bytes/cycle