What decides the ordering of threadIdx, blockIdx, blockDim, gridDim statements?

SKA · September 18, 2021, 6:44am

I was expecting the output statements to be in the same order as the device code snippet i.e one thread after the other. What causes the below observed output statement order ?

device code snippet:

printf(“threadIdx: %d,%d,%d\n”,threadIdx.x,threadIdx.y,threadIdx.z);
printf(“blockIdx: %d,%d,%d\n”,blockIdx.x,blockIdx.y,blockIdx.z);
printf(“blockDim: %d,%d,%d\n”,blockDim.x,blockDim.y,blockDim.z);
printf(“gridDim: %d,%d,%d\n”,gridDim.x,gridDim.y,gridDim.z);

Output:

threadIdx: 0,0,0
threadIdx: 1,0,0
threadIdx: 2,0,0
threadIdx: 0,0,0
threadIdx: 1,0,0
threadIdx: 2,0,0
blockIdx: 1,0,0
blockIdx: 1,0,0
blockIdx: 1,0,0
blockIdx: 0,0,0
blockIdx: 0,0,0
blockIdx: 0,0,0
blockDim: 3,1,1
blockDim: 3,1,1
blockDim: 3,1,1
blockDim: 3,1,1
blockDim: 3,1,1
blockDim: 3,1,1
gridDim: 2,1,1
gridDim: 2,1,1
gridDim: 2,1,1
gridDim: 2,1,1
gridDim: 2,1,1
gridDim: 2,1,1

Robert_Crovella · September 18, 2021, 12:09pm

Your mentality as a CUDA programmer should be that threads can execute in any order. Therefore asking questions like this is risky if you subsequently go on to program expecting the answer to apply in other cases. However:

threads within a warp generally execute in lockstep. That is they all execute the first instruction, then they all execute the second instruction, and so on.
threads in different warps are generally not considered to execute in “lockstep”
device printf is probably a special case where there is some kind of serialization of behavior going on, meaning that since everything ends up in a single “buffer”, somehow there is some kind of ordering “imposed” by the printf routine (somehow).
if your gpu has enough “capacity” then it may be that threads from different warps or even threadblocks appear to be executing “concurrently”

Variables like blockDim and gridDim are indentical for all threads in the grid; therefore we expect these to be the same; no ordering is evident within those groups.

So the threadIdx printout appears first, because it appears first in your code. threadIdx is unique within a block but not unique across the grid. It appears you have a launch configuration of <<<2,3>>>. This consists of 2 warps, one warp in one block and one in another. There is no defined order for the threads in a warp, they are generally considered to be executing in lockstep, so there is some unpublished mechanics that are causing thread 0 (in the warp) to get access to the printf routine first, then 1 then 2.

The blockIdx printout is next because it is next in your code. Threads in a warp are executing in lockstep, and threads across blocks (in this case) appear to be executing roughly “concurrently” (in this case). In this case it appears that block 1 got to the “printf” routine before block 0 did. This sort of observation should be considered “incidental”, not something you should rely on.

The blockDim printout appears next, because the printf statement is next, and the gridDim printout appears last, because the printf statement is last. No ordering is “evident” among these for the reasons already stated.

SKA · September 26, 2021, 9:49am

Thanks for the great explanation ! It helps.

system · November 25, 2021, 9:49am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why is this CUDA program ignoring if-else guard? CUDA Programming and Performance	6	422	August 12, 2023
Why cuda printf in threads could not print in complete randomness? CUDA Programming and Performance	2	1575	April 8, 2018
Record the execution order of the threads? CUDA Programming and Performance	7	7421	March 28, 2008
Introduction of a 'printf("");' leads to different results CUDA Programming and Performance	1	1991	October 13, 2011
global to shared mem loads and sync CUDA Programming and Performance	26	11441	February 21, 2008
[warp's asynchrous behavior] does warp 0 always gets executed first? CUDA Programming and Performance	2	274	August 25, 2023
Thread Divergence, branches, examples CUDA Programming and Performance	3	2681	May 25, 2011
syncthreads doesn't stop a half of threads. CUDA Programming and Performance	9	916	June 5, 2015
It seems that CUDA kernel are not running parallel CUDA Programming and Performance	5	25	August 29, 2024
Interactions among blocks CUDA Programming and Performance	11	11461	February 6, 2010

What decides the ordering of threadIdx, blockIdx, blockDim, gridDim statements?

Related topics