Why is this CUDA kernel repeating indices with a 3D grid?

I’m trying to figure out this bug, and I’ve reduced it to the simplest implementation I could, but I can’t figure out exactly whats happening.

__device__ int get_index(int i, int j, int k, int nx, int ny) {
    return i + (nx * j) + (nx * ny * k);
}

__global__ void update_Ex(const int Nx, const int Ny, const int Nz) {
    int i = threadIdx.x + blockIdx.x * gridDim.x;
    int j = threadIdx.y + blockIdx.y * gridDim.y;
    int k = threadIdx.z + blockIdx.z * gridDim.z;

    int ijk = get_index(i, j, k, Nx, Ny);

    if ((i < Nx) && (j < Ny) && (k < Nz)){
         printf("%d %d %d %d\n", i, j, k, ijk);
    }

    /* this outputs the same thing */
    // if (ijk < Nx * Ny * Nz) {
    //     printf("%d %d %d %d\n", i, j, k, ijk);
    // }
}


int main() {
    int Nx = 16;
    int Ny = Nx;
    int Nz = Nx;


    int threads = 8;
    dim3 TPB(threads, threads, threads);

    int bx = (Nx + TPB.x - 1) / TPB.x;
    int by = (Ny + TPB.y - 1) / TPB.y;
    int bz = (Nz + TPB.z - 1) / TPB.z;

    dim3 BPG(bx, by, bz);

    // Launch kernel
    update_Ex<<< BPG, TPB >>> (Nx, Ny, Nz);
    cudaDeviceSynchronize();


    return 0;
}

If I capture the output to a file and sort by the IJK index, I get the following outputs

0 [(0, 0, 0)]
1 [(1, 0, 0)]
2 [(2, 0, 0), (2, 0, 0)]
3 [(3, 0, 0), (3, 0, 0)]
4 [(4, 0, 0), (4, 0, 0)]
...
32 [(0, 2, 0), (0, 2, 0)]
33 [(1, 2, 0), (1, 2, 0)]
34 [(2, 2, 0), (2, 2, 0), (2, 2, 0), (2, 2, 0)]
35 [(3, 2, 0), (3, 2, 0), (3, 2, 0), (3, 2, 0)]
36 [(4, 2, 0), (4, 2, 0), (4, 2, 0), (4, 2, 0)]

So why is it repeatedly calculating the same indices? Some of the IJK indices have up to 8 print statements execute.

I know it has something to do with using a cubic grid (dim3 (bx, by, bz)), but there is so little good documentation on how to use CUDA in this manner that I have no idea what might be going wrong.

I could switch to a “linear” array where instead of dim3(bx, by, bz) I use (bx, 1, 1) which is equivalent to passing a single integer to the kernel launch, but I could really use the separate i, j, k indices since I’m computing specific offsets from those.

eg.

ijk = i + (Nx * j) + (Nx * Ny * k);
jm1 = i + (Nx * (j - 1)) + (Nx * Ny * k);
km1 = i + (Nx * j) + (Nx * Ny * (k - 1));

Follow up question, why is a 3D grid dim3(x, y, z) almost never used? I can’t find any good documentation, nor any good examples, stackoverflow posts, or random blog spots that utilize CUDA this way. Most people only use the x-dimension, and occasionally use x-y. I haven’t found a single use-case that does it this way, yet it’s always described as possible before being ignored.

This not how you compute a canonical index in typical 3D grid:

int i = threadIdx.x + blockIdx.x * gridDim.x;
                                   ^^^^^^^^^

and likewise for j, k. That should be blockDim.x (and .y and .z):

int i = threadIdx.x + blockIdx.x * blockDim.x;

If you study nearly any CUDA code, for example the cuda vectorAdd sample code, you will see that kind of construct for .x (and it structurally does not change for 2D or 3D grids).

The most important documentation reference source for CUDA is the programming guide. This thread/built-in-variable structure is documented starting here, and starting at that point I found more than 10 examples of the usage that I am suggesting above, in both 1D and 2D variants/flavors.

I’ve already pointed to the documentation, if you think it should be improved, feel free to file a bug. I think maybe the 3D case doesn’t receive much attention, perhaps because if you can extend 1D to 2D as is covered in some detail in the documentation, further extending to 3D doesn’t seem like it should be difficult. Again, if you disagree, you may wish to file a bug. The obvious reason I can think of to use a 3D grid scheme is if you are processing a 3D data set, although it’s not mandatory or the only way to do that. There are certainly stackoverflow posts and also cuda sample codes that use 3D grids. I will give an example of each:

SO

(here are 10 or more additional examples on SO, and I’m sure there are others as well.)

CUDA sample code: volumeFiltering

Follow up question, why is a 3D grid dim3(x, y, z) almost never used? I can’t find any good documentation, nor any good examples, stackoverflow posts, or random blog spots that utilize CUDA this way.

Just because a language feature exists doesn’t mean that programmers find it particularly useful or that it is the best fit for common use cases. The question is akin to “Why is the standard math function remquo() almost never used? It exists in C, C++, CUDA and other programming languages yet beyond basic documentation there don’t seem to be any good examples, stackoverflow posts, or random blog posts that utilize this math function.”

I don’t think it is a particular issue here, but another caveat to be aware of is that if you are using in-kernel printf for these kinds of bulk verification tests, there is a limit on the total amount of output you can get, and there is no warning or indicator that you have exceeded the limit. Therefore I generally don’t recommend in-kernel printf for large-scale output.

Yes, it’s documented, and yes you can find SO questions on usage of in-kernel printf.

If you decide to fix/repeat this test on a larger grid, you may eventually run into this.

and likewise for j, k. That should be blockDim.x (and .y and .z):

yeah, that will do it. I just passed over that particular problem for a while, apparently.

I was just getting frustrated because, for example, the first link to the Programming Guide only covers X and Y, but not Z, and since my Z version isn’t working (I did get a 2d version working though, now I think it was just coincidence) I couldn’t find anything more concrete to help me through it.

Thank you though, I’ll give some of those other links a thorough reading. I just have to start asking the right questions.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.