Confusion regarding CUDA 2D indexing found in the official Programming Guide

From C/C++ we know that a two dimensional array v[y_index][x_index] is stored as one peace in row-major order where adjacent elements differ in one x position.

The manual states on page 12:

The index of a thread and its thread ID relate to each other in a straightforward way: For a onedimensional
block, they are the same; for a two-dimensional block of size (Dx, Dy), the thread ID of
a thread of index (x, y) is (x + y Dx); for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a
thread of index (x, y, z) is (x + y Dx + z Dx Dy).

So it seems that the thread index is organized in the same way.

BUT

As an example, the following code adds two matrices A and B of size NxN and stores the result into
matrix C.
∕∕ Kernel definition
global void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{

∕∕ Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

Here, i is associated with x as if the matrix was transposed, since A[i][j] will skip one “row” for i and increment for j. Can someone explain this paradoxon?

It would suggest that x and y is renamed and data is stored in column-major-order, if the language C still applies.

Now moving to the FFT guide:

Advanced layout can be perceived as an additional layer of abstraction above the access to input/
output data arrays. An element of coordinates [z][y][x] in signal number b in the batch will
be associated with the following addresses in the memory:
▶ 1D
input[ b * idist + x * istride ]
output[ b * odist + x * ostride ]
▶ 2D
input[ b * idist` + (x * inembed[1] + y) * istride ]
output[ b * odist + (x * onembed[1] + y) * ostride ]
▶ 3D
input[ b * idist + ((x * inembed[1] + y) * inembed[2] + z) * istride ]
output[ b * odist + ((x * onembed[1] + y) * onembed[2] + z) * ostride ]
The istride and ostride parameters denote the distance between two successive input and output
elements in the least significant (that is, the innermost) dimension respectively. In a single 1D transform,
if every input element is to be used in the transform, istride should be set to 1; if every other
input element is to be used in the transform, then istride should be set to 2. Similarly, in a single
1D transform, if it is desired to output final elements one after another compactly, ostride should
be set to 1; if spacing is desired between the least significant dimension output data, ostride should
be set to the distance between the elements.
The inembed and onembed parameters define the number of elements in each dimension in the
input array and the output array respectively. The inembed[rank-1] contains the number of elements
in the least significant (innermost) dimension of the input data excluding the istride elements;
the number of total elements in the least significant dimension of the input array is then
istrideinembed[rank-1]. The inembed[0] or onembed[0] corresponds to the most significant*
(that is, the outermost) dimension and is effectively ignored since the idist or odist parameter provides
this information instead. Note that the size of each dimension of the transform should be less
than or equal to the inembed and onembed values for the corresponding dimension, that is n[i] ￿
inembed[i], n[i] ￿ onembed[i], where i ∈ {0, . . . , rank − 1}.

What is the innermost dimension here? x? Once again, x seems to “jump”, not like in C++.

AFAIK much of the FFT convention comes from Fortan originally.

As to the use of Cuda block and thread indices: Do as you need and like. Functionally it does not matter, which thread performs some task. You can even change that within one function on the device, as long as you care for which data you store in local variables.
For performance you have to see that memory accesses are coalesced or without bank conflicts, respectively.

I have done kernels with very complex multiple single bit shifts and recombinations in the thread numbers to get an index.

Thanks.

I realized that this book is mandatory to understand the whole process:

Programming
Massively Parallel
Processors