Why is this CUDA kernel repeating indices with a 3D grid?

This not how you compute a canonical index in typical 3D grid:

int i = threadIdx.x + blockIdx.x * gridDim.x;
                                   ^^^^^^^^^

and likewise for j, k. That should be blockDim.x (and .y and .z):

int i = threadIdx.x + blockIdx.x * blockDim.x;

If you study nearly any CUDA code, for example the cuda vectorAdd sample code, you will see that kind of construct for .x (and it structurally does not change for 2D or 3D grids).

The most important documentation reference source for CUDA is the programming guide. This thread/built-in-variable structure is documented starting here, and starting at that point I found more than 10 examples of the usage that I am suggesting above, in both 1D and 2D variants/flavors.

I’ve already pointed to the documentation, if you think it should be improved, feel free to file a bug. I think maybe the 3D case doesn’t receive much attention, perhaps because if you can extend 1D to 2D as is covered in some detail in the documentation, further extending to 3D doesn’t seem like it should be difficult. Again, if you disagree, you may wish to file a bug. The obvious reason I can think of to use a 3D grid scheme is if you are processing a 3D data set, although it’s not mandatory or the only way to do that. There are certainly stackoverflow posts and also cuda sample codes that use 3D grids. I will give an example of each:

SO

(here are 10 or more additional examples on SO, and I’m sure there are others as well.)

CUDA sample code: volumeFiltering