From C/C++ we know that a two dimensional array v[y_index][x_index] is stored as one peace in row-major order where adjacent elements differ in one x position.
The manual states on page 12:
The index of a thread and its thread ID relate to each other in a straightforward way: For a onedimensional
block, they are the same; for a two-dimensional block of size (Dx, Dy), the thread ID of
a thread of index (x, y) is (x + y Dx); for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a
thread of index (x, y, z) is (x + y Dx + z Dx Dy).
So it seems that the thread index is organized in the same way.
BUT
As an example, the following code adds two matrices A and B of size NxN and stores the result into
matrix C.
∕∕ Kernel definition
global void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
…
∕∕ Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
Here, i is associated with x as if the matrix was transposed, since A[i][j] will skip one “row” for i and increment for j. Can someone explain this paradoxon?
It would suggest that x and y is renamed and data is stored in column-major-order, if the language C still applies.
Now moving to the FFT guide:
Advanced layout can be perceived as an additional layer of abstraction above the access to input/
output data arrays. An element of coordinates [z][y][x] in signal number b in the batch will
be associated with the following addresses in the memory:
▶ 1D
input[ b * idist + x * istride ]
output[ b * odist + x * ostride ]
▶ 2D
input[ b * idist` + (x * inembed[1] + y) * istride ]
output[ b * odist + (x * onembed[1] + y) * ostride ]
▶ 3D
input[ b * idist + ((x * inembed[1] + y) * inembed[2] + z) * istride ]
output[ b * odist + ((x * onembed[1] + y) * onembed[2] + z) * ostride ]
The istride and ostride parameters denote the distance between two successive input and output
elements in the least significant (that is, the innermost) dimension respectively. In a single 1D transform,
if every input element is to be used in the transform, istride should be set to 1; if every other
input element is to be used in the transform, then istride should be set to 2. Similarly, in a single
1D transform, if it is desired to output final elements one after another compactly, ostride should
be set to 1; if spacing is desired between the least significant dimension output data, ostride should
be set to the distance between the elements.
The inembed and onembed parameters define the number of elements in each dimension in the
input array and the output array respectively. The inembed[rank-1] contains the number of elements
in the least significant (innermost) dimension of the input data excluding the istride elements;
the number of total elements in the least significant dimension of the input array is then
istrideinembed[rank-1]. The inembed[0] or onembed[0] corresponds to the most significant*
(that is, the outermost) dimension and is effectively ignored since the idist or odist parameter provides
this information instead. Note that the size of each dimension of the transform should be less
than or equal to the inembed and onembed values for the corresponding dimension, that is n[i]
inembed[i], n[i] onembed[i], where i ∈ {0, . . . , rank − 1}.
What is the innermost dimension here? x? Once again, x seems to “jump”, not like in C++.