Calculate GLOBAL thread Id

Hello,

I am new to CUDA and trying to wrap my head around calculating the ‘global thread id’

What I mean by this is the following:

Say we have a grid of (2,2,1) and a blocks of (16,16,1) this will generate 1024 threads with the kernel invocation.

I am referring the ‘global thread id’ being each unique instance of a thread within the kernel. in this case threads 0 - 1023.

I am having trouble figuring out how to calculate it though. Here is what I’ve tried:

  1. Per CUDA Programming Guide:
int global_index = threadIdx.x + blockDim.x * threadIdx.y

but this seems to be the thread Id for the block, not the kernel.

  1. Per other documentation I have read:
int xindex = threadIdx.x + blockIdx.x * blockDim.x;

int yindex = threadIdx.y + blockIdx.y * blockDim.y;

int global_index = xindex + (gridDim.x * gridDim.y * yindex);

This just doesn’t get close.

It is clear that I am missing something very fundamental here…I just can wrap my head around it.

Any help is much appreciated.

3 Likes

Try tackling it with a top-down approach. At the top level, we have

globalThreadNum = blockNumInGrid * threadsPerBlock + threadNumInBlock

For now, assume we have a 2D grid and 2D blocks. Then

threadsPerBlock  = blockDim.x * blockDim.y

threadNumInBlock = threadIdx.x + blockDim.x * threadIdx.y (alternatively: threadIdx.y + blockDim.y * threadIdx.x)

blockNumInGrid   = blockIdx.x  + gridDim.x  * blockIdx.y  (alternatively: blockIdx.y  + gridDim.y  * blockIdx.x)

Analogous for 3D grids and 3D blocks.

Ah! Yes, makes much more sense. Thanks for the help!

Here’s universal Global index calculation function I use.

#include “cuda_runtime_.h”
#include “device_launch_parameters.h”
#include <stdio.h>

//////////////////////////////////////////////////////////////////////////////
//Universal Gid calulation on any Dimensional grid and any Dimensional Block//
//////////////////////////////////////////////////////////////////////////////
//Kernel code
__global__void universalGidCalculation(int* input)
{
//First section locates and calculates thread offset within a block
int column = threadIdx.x;
int row = threadIdx.y;
int aisle = threadIdx.z;
int threads_per_row = blockDim.x; //# threads in x direction aka row
int threads_per_aisle = (blockDim.x * blockDim.y); //# threads in x and y direction for total threads per aisle

int threads_per_block = (blockDim.x * blockDim.y * blockDim.z);
int rowoffset = (row * threads_per_row);//how many rows to push out offset by
int aisleOffset = (aisle * threads_per_aisle);// how many aisles to push out offset by

//Second section locates and caculates block offset withing the grid
int blockColumn = blockIdx.x;
int blockRow = blockIdx.y;
int blockAisle = blockIdx.z;
int blocks_per_row = gridDim.x;//# blocks in x direction aka blocks per row
int blocks_per_aisle = (gridDim.x * gridDim.y); // # blocks in x and y direction for total blocks per aisle
int blockRowOffset = (blockRow * blocks_per_row);// how many rows to push out block offset by
int blockAisleOffset = (blockAisle * blocks_per_aisle);// how many aisles to push out block offset by
int blockId = blockColumn + blockRowOffset + blockAisleOffset;

int blockOffset = (blockId * threads_per_block);

int gid = (blockOffset + aisleOffset + rowOffset + column);

printf ("blockIdx : (%d,%d,%d) ThreadIdx :(%d,%d,%d), gid : (%2d), input[gid] :(%2d) \n",
         blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z, gid, input[gidl]);

}//end universalGIDcalculation

int main()
{
int arraySize = 64;
int arrayByteSize = sizeof(int) * arrayByteSize;
int hostData = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,
33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63};

//print up array before running kernel code on GPU to compare against
for (int i =0; i < arraySize; i++)
{
    if (i == 33)
    {
        printf("%d\n", hostData[i]);
    }
    else
    printf("%d ", hostData[i]);
}
printf("\n\n");

//Create device Data ptr, allocate the space on GPU and then copy it over
int * deviceData;
cudaMalloc((void**)&deviceData, arrayByteSize);
cudaMemcpy(deviceData, hostData, arrayByteSize, cudaMemcpyHostToDevice);

//Create blocks and grid
//Change these values to reflect different block and grid sizes to test and verify GID calculation above.
dim3 block(2,2,2);
dim3 grid(2,2,2);

universalGidCalculation <<< grid, block >>> (deviceData);

cudaDeviceSynchronize();

cudaDeviceReset();

return 0;

}//end main

Thanks for this explanation, @njuffa .

I am a bit puzzled by the ambiguity of the calculation for threadNumInBlock. You describe two alternatives


threadNumInBlock = threadIdx.x + blockDim.x * threadIdx.y (alternatively: threadIdx.y + blockDim.y * threadIdx.x)

If working with treadId (for example) (1,0) and a blockDim of (32, 32) the two alternatives would return either 1 or 32. Is that correct? Shouldn’t there be just one possible result?

Not ambiguity but choice. This is a 2D-to-1D mapping. A programmer can freely choose whether they want to use column-major or row-major ordering for this mapping.

Thanks for the explanation!

Can you maybe point me to the section in the cuda programming guide (Contents — CUDA C Programming Guide) specifying how users can choose the mapping? I would expect this to be documented in section 2, but I cannot find anything related.

To my knowledge, there is no concept of a “global thread ID” in CUDA, that is why you cannot find it in the Programming Guide. GPU hardware may have such a concept (and it may have bearing on performance), but looking back to the start of the thread, it seems to me the OP simply wanted to enumerate all threads of a kernel launch when using both multi-dimensional grid and multi-dimensional block organization.

Why would one want to perform such a flattening? For example, when the available degree of parallelism exceeds any one dimension of a grid. Example: Recently someone inquired how to examine 264 cases of a particular computation in parallel to filter out cases matching some particular set of constraints. I pointed out that this would take about two years on the fastest GPUs currently available, but that something like 250 cases might be practical to tackle.

One can enumerate threads in any which way desired. However, it makes the most sense to consistently enumerate in either row-major or column-major fashion, analogous to determining an address offset in a multi-dimensional array/ in C++.

1 Like

Thank you very much for your explanation, @njuffa .