Unlock GPU Performance: Global Memory Access in CUDA

Originally published at: Unlock GPU Performance: Global Memory Access in CUDA | NVIDIA Technical Blog

Managing memory is one of the most important performance characteristics to consider when writing a GPU kernel.  This post walks you through the important aspects you should know about global memory and its performance. Global Memory There are several kinds of memory on a CUDA device, each with different scope, lifetime, and caching behavior. Global…

there might be a typo in this blog?

When using 2 or 3-dimensional thread blocks in a CUDA kernel, the threads are laid out linearly with the X index, or threadIdx.x, moving the fastest, then Y (threadIdx.y) and then Z (threadIdx.z). For example, if we have a 2D thread block with size (4,2), the threads will be ordered as: (0,0)(1,0)(2,0)(3,0)(1,0)(1,1)(2,1)(3,1).

What the blog says:

(0,0)(1,0)(2,0)(3,0)(1,0)(1,1)(2,1)(3,1)

What I think is correct:

(0,0)(1,0)(2,0)(3,0)(0,1)(1,1)(2,1)(3,1).