What decides the ordering of threadIdx, blockIdx, blockDim, gridDim statements?

I was expecting the output statements to be in the same order as the device code snippet i.e one thread after the other. What causes the below observed output statement order ?

device code snippet:

printf(“threadIdx: %d,%d,%d\n”,threadIdx.x,threadIdx.y,threadIdx.z);
printf(“blockIdx: %d,%d,%d\n”,blockIdx.x,blockIdx.y,blockIdx.z);
printf(“blockDim: %d,%d,%d\n”,blockDim.x,blockDim.y,blockDim.z);
printf(“gridDim: %d,%d,%d\n”,gridDim.x,gridDim.y,gridDim.z);

Output:

threadIdx: 0,0,0
threadIdx: 1,0,0
threadIdx: 2,0,0
threadIdx: 0,0,0
threadIdx: 1,0,0
threadIdx: 2,0,0
blockIdx: 1,0,0
blockIdx: 1,0,0
blockIdx: 1,0,0
blockIdx: 0,0,0
blockIdx: 0,0,0
blockIdx: 0,0,0
blockDim: 3,1,1
blockDim: 3,1,1
blockDim: 3,1,1
blockDim: 3,1,1
blockDim: 3,1,1
blockDim: 3,1,1
gridDim: 2,1,1
gridDim: 2,1,1
gridDim: 2,1,1
gridDim: 2,1,1
gridDim: 2,1,1
gridDim: 2,1,1

Your mentality as a CUDA programmer should be that threads can execute in any order. Therefore asking questions like this is risky if you subsequently go on to program expecting the answer to apply in other cases. However:

  • threads within a warp generally execute in lockstep. That is they all execute the first instruction, then they all execute the second instruction, and so on.
  • threads in different warps are generally not considered to execute in “lockstep”
  • device printf is probably a special case where there is some kind of serialization of behavior going on, meaning that since everything ends up in a single “buffer”, somehow there is some kind of ordering “imposed” by the printf routine (somehow).
  • if your gpu has enough “capacity” then it may be that threads from different warps or even threadblocks appear to be executing “concurrently”

Variables like blockDim and gridDim are indentical for all threads in the grid; therefore we expect these to be the same; no ordering is evident within those groups.

So the threadIdx printout appears first, because it appears first in your code. threadIdx is unique within a block but not unique across the grid. It appears you have a launch configuration of <<<2,3>>>. This consists of 2 warps, one warp in one block and one in another. There is no defined order for the threads in a warp, they are generally considered to be executing in lockstep, so there is some unpublished mechanics that are causing thread 0 (in the warp) to get access to the printf routine first, then 1 then 2.

The blockIdx printout is next because it is next in your code. Threads in a warp are executing in lockstep, and threads across blocks (in this case) appear to be executing roughly “concurrently” (in this case). In this case it appears that block 1 got to the “printf” routine before block 0 did. This sort of observation should be considered “incidental”, not something you should rely on.

The blockDim printout appears next, because the printf statement is next, and the gridDim printout appears last, because the printf statement is last. No ordering is “evident” among these for the reasons already stated.

Thanks for the great explanation ! It helps.