I’m trying to figure out this bug, and I’ve reduced it to the simplest implementation I could, but I can’t figure out exactly whats happening.
__device__ int get_index(int i, int j, int k, int nx, int ny) {
return i + (nx * j) + (nx * ny * k);
}
__global__ void update_Ex(const int Nx, const int Ny, const int Nz) {
int i = threadIdx.x + blockIdx.x * gridDim.x;
int j = threadIdx.y + blockIdx.y * gridDim.y;
int k = threadIdx.z + blockIdx.z * gridDim.z;
int ijk = get_index(i, j, k, Nx, Ny);
if ((i < Nx) && (j < Ny) && (k < Nz)){
printf("%d %d %d %d\n", i, j, k, ijk);
}
/* this outputs the same thing */
// if (ijk < Nx * Ny * Nz) {
// printf("%d %d %d %d\n", i, j, k, ijk);
// }
}
int main() {
int Nx = 16;
int Ny = Nx;
int Nz = Nx;
int threads = 8;
dim3 TPB(threads, threads, threads);
int bx = (Nx + TPB.x - 1) / TPB.x;
int by = (Ny + TPB.y - 1) / TPB.y;
int bz = (Nz + TPB.z - 1) / TPB.z;
dim3 BPG(bx, by, bz);
// Launch kernel
update_Ex<<< BPG, TPB >>> (Nx, Ny, Nz);
cudaDeviceSynchronize();
return 0;
}
If I capture the output to a file and sort by the IJK index, I get the following outputs
0 [(0, 0, 0)]
1 [(1, 0, 0)]
2 [(2, 0, 0), (2, 0, 0)]
3 [(3, 0, 0), (3, 0, 0)]
4 [(4, 0, 0), (4, 0, 0)]
...
32 [(0, 2, 0), (0, 2, 0)]
33 [(1, 2, 0), (1, 2, 0)]
34 [(2, 2, 0), (2, 2, 0), (2, 2, 0), (2, 2, 0)]
35 [(3, 2, 0), (3, 2, 0), (3, 2, 0), (3, 2, 0)]
36 [(4, 2, 0), (4, 2, 0), (4, 2, 0), (4, 2, 0)]
So why is it repeatedly calculating the same indices? Some of the IJK indices have up to 8 print statements execute.
I know it has something to do with using a cubic grid (dim3 (bx, by, bz)), but there is so little good documentation on how to use CUDA in this manner that I have no idea what might be going wrong.
I could switch to a “linear” array where instead of dim3(bx, by, bz)
I use (bx, 1, 1)
which is equivalent to passing a single integer to the kernel launch, but I could really use the separate i, j, k indices since I’m computing specific offsets from those.
eg.
ijk = i + (Nx * j) + (Nx * Ny * k);
jm1 = i + (Nx * (j - 1)) + (Nx * Ny * k);
km1 = i + (Nx * j) + (Nx * Ny * (k - 1));
Follow up question, why is a 3D grid dim3(x, y, z)
almost never used? I can’t find any good documentation, nor any good examples, stackoverflow posts, or random blog spots that utilize CUDA this way. Most people only use the x-dimension, and occasionally use x-y. I haven’t found a single use-case that does it this way, yet it’s always described as possible before being ignored.