Confusion regarding CUDA 2D indexing found in the official Programming Guide

m.piz · March 24, 2025, 2:59pm

From C/C++ we know that a two dimensional array v[y_index][x_index] is stored as one peace in row-major order where adjacent elements differ in one x position.

The manual states on page 12:

The index of a thread and its thread ID relate to each other in a straightforward way: For a onedimensional
block, they are the same; for a two-dimensional block of size (Dx, Dy), the thread ID of
a thread of index (x, y) is (x + y Dx); for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a
thread of index (x, y, z) is (x + y Dx + z Dx Dy).

So it seems that the thread index is organized in the same way.

BUT

As an example, the following code adds two matrices A and B of size NxN and stores the result into
matrix C.
∕∕ Kernel definition
global void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
…
∕∕ Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

Here, i is associated with x as if the matrix was transposed, since A[i][j] will skip one “row” for i and increment for j. Can someone explain this paradoxon?

It would suggest that x and y is renamed and data is stored in column-major-order, if the language C still applies.

Now moving to the FFT guide:

Advanced layout can be perceived as an additional layer of abstraction above the access to input/
output data arrays. An element of coordinates [z][y][x] in signal number b in the batch will
be associated with the following addresses in the memory:
▶ 1D
input[ b * idist + x * istride ]
output[ b * odist + x * ostride ]
▶ 2D
input[ b * idist` + (x * inembed[1] + y) * istride ]
output[ b * odist + (x * onembed[1] + y) * ostride ]
▶ 3D
input[ b * idist + ((x * inembed[1] + y) * inembed[2] + z) * istride ]
output[ b * odist + ((x * onembed[1] + y) * onembed[2] + z) * ostride ]
The istride and ostride parameters denote the distance between two successive input and output
elements in the least significant (that is, the innermost) dimension respectively. In a single 1D transform,
if every input element is to be used in the transform, istride should be set to 1; if every other
input element is to be used in the transform, then istride should be set to 2. Similarly, in a single
1D transform, if it is desired to output final elements one after another compactly, ostride should
be set to 1; if spacing is desired between the least significant dimension output data, ostride should
be set to the distance between the elements.
The inembed and onembed parameters define the number of elements in each dimension in the
input array and the output array respectively. The inembed[rank-1] contains the number of elements
in the least significant (innermost) dimension of the input data excluding the istride elements;
the number of total elements in the least significant dimension of the input array is then
istrideinembed[rank-1]. The inembed[0] or onembed[0] corresponds to the most significant*
(that is, the outermost) dimension and is effectively ignored since the idist or odist parameter provides
this information instead. Note that the size of each dimension of the transform should be less
than or equal to the inembed and onembed values for the corresponding dimension, that is n[i] 
inembed[i], n[i] onembed[i], where i ∈ {0, . . . , rank − 1}.

What is the innermost dimension here? x? Once again, x seems to “jump”, not like in C++.

Curefab · March 24, 2025, 4:32pm

AFAIK much of the FFT convention comes from Fortan originally.

As to the use of Cuda block and thread indices: Do as you need and like. Functionally it does not matter, which thread performs some task. You can even change that within one function on the device, as long as you care for which data you store in local variables.
For performance you have to see that memory accesses are coalesced or without bank conflicts, respectively.

I have done kernels with very complex multiple single bit shifts and recombinations in the thread numbers to get an index.

m.piz · April 1, 2025, 1:05pm

Thanks.

I realized that this book is mandatory to understand the whole process:

Programming
Massively Parallel
Processors

Topic		Replies	Views
Thread id and thread index in 2D CUDA Programming and Performance	4	6458	November 20, 2014
Problem with 2-dimensional thread blocks CUDA Programming and Performance	6	3560	March 20, 2012
Threads begginer question CUDA Programming and Performance	8	8285	July 16, 2007
CUDA 2d Array Mapping CUDA Programming and Performance	1	3597	April 24, 2015
Subtle problem with matrix-matrix multiplication CUDA Programming and Performance	10	1397	December 12, 2020
1D/2D indexes usage in a kernel CUDA Programming and Performance	3	888	January 31, 2011
conceptual doubt about CUDA CUDA Programming and Performance	3	814	June 2, 2015
jacobi relaxation indexing in two dimension CUDA Programming and Performance	2	9343	April 13, 2009
threadIdx index origin CUDA Programming and Performance	2	837	February 8, 2010
About thread ids Legacy PGI Compilers (archived)	2	2730	October 24, 2012

Confusion regarding CUDA 2D indexing found in the official Programming Guide

Related topics