Originally published at: Unlock GPU Performance: Global Memory Access in CUDA | NVIDIA Technical Blog
Managing memory is one of the most important performance characteristics to consider when writing a GPU kernel. This post walks you through the important aspects you should know about global memory and its performance. Global Memory There are several kinds of memory on a CUDA device, each with different scope, lifetime, and caching behavior. Global…
there might be a typo in this blog?
When using 2 or 3-dimensional thread blocks in a CUDA kernel, the threads are laid out linearly with the X index, or threadIdx.x, moving the fastest, then Y (threadIdx.y) and then Z (threadIdx.z). For example, if we have a 2D thread block with size
(4,2), the threads will be ordered as:(0,0)(1,0)(2,0)(3,0)(1,0)(1,1)(2,1)(3,1).
What the blog says:
(0,0)(1,0)(2,0)(3,0)(1,0)(1,1)(2,1)(3,1)
What I think is correct:
(0,0)(1,0)(2,0)(3,0)(0,1)(1,1)(2,1)(3,1).