So you have 4 elements per block and 9 blocks per grid making a total of 36 adressable elements

0th element: tid.x = tid.y = bid.x = bid.y = 0

id = 0 + 02 + 022 + 0223 = 0

1st element: tid.x = 1 all others = 0:

id = 1 + …(0) = 1

2nd element: tid.x is zero again but tid.y = 1:

id = 0 + 12 + 022 + 0223 = 2

and so on…

so you see you have 4 “directions” to count:

threadIdx.x (innermost): if it reaches the blockDim.x border, it is reset to zero and threadIdx.y is increased by one.

threadIdx.y (number of lines of blockDim.x elements): if this one reaches the blockDim.y border, it is reset to zero and the blockIdx.x is increased by one.

blockIdx.x (number of blocks or in the 1D case “lines” of blockDim.x*blockDim.y elements): if this one reaches the gridDim.x border, it is reset to zero and the blockIdx.y is increased by one.

blockIdx.y (number of blocks or in the 1D case “lines” of blockDim.xblockDim.ygridDim.x elements): if this one reaches the gridDim.y border, the kernel is finished.