Although I’ve googled quite a while, I haven’t found, how exactly NVIDIA GPUs behave in the case of a global store missing the L2 cache.
As far as I know the L2 cache of CUDA GPUs is a write allocate/write back cache having a cache line size of 128 byte. Moreover, each cache line consists of four 32 Byte segments, which matches the GDDR5 transaction size. Because of this, the segments allow NVIDIA GPUs to perform fine grained DRAM accesses.
Consequently, I suppose that, if a global store misses the L2, the GPU will process the store as follows:
-If the store fills a 32 Byte segment completely (e.g. because the store is aligned and perfectly coalesced), then the GPU does not fetch any data from DRAM
-However, if the store does not fill the segment completely (e.g. because the access is misaligned and not completely coalesced), then the remaining part of the segment is filled up with data from DRAM causing a 32 Byte load transaction.
This assumption roughly matches with the performance decrease in the benchmark of the CUDA performance guide, which copies global memory in a misaligned manner ( http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#effects-of-misaligned-accesses ). Unfortunately, NVIDIA’s discussion for the performance of this benchmark does not seem sound to me, since it only considers DRAM loads, which the L2 cache should perfectly smooth out. Interestingly, NVIDIA has performed another similiar benchmark, which increments a vector residing in global memory by a constant also in a misaligned manner ( https://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/ ). In contrast to the first benchmark, the second benchmark does not show a performance decrease. Hence, I conclude that there are not any penalties for misaligned stores, if the data already resides in the L2 cache. Overall, both benchmarks match the behaviour of a write-allocate/write-back cache.
Are my assumptions correct or am I assuming anything wrong?