3D Grid and 3D Block Thread Indexing for 2 Nested For Loops

Hello,

I am new to CUDA. I am parallelizing a serial task where I am supposed to traverse over 10M bodies, comparing each body with BODIES-1 number of other bodies than itself. Hence the total iterations are 10M * 10M = 1e+14 iterations in a traditional nested for loop fashion.

The problem is, since the serial code deals with this as a 2D problem i.e. using 2 nested for loops, when I try to reach 1e+14 threads, I have to use 3D blocks and threads due to the limit of max number of blocks. i.e. the max I can get from 2D grid and block is (653556535532*32 = 4.3737866e+12 threads.

So, I’m using (10000100001000) 3D grid with 101010 threads for this 2D problem, which nicely amounts to 1e+14 threads. However, I have no clue how to do the thread indexing. I have a 3D grid and block, and I don’t know how to traverse over 2 nested for loops using this.

I know this much that I’ve got this:

int i=blockIdx.yblockDim.y+threadIdx.y;
int j=blockIdx.x
blockDim.x+threadIdx.x;

And I need to somehow represent the z index like so that each thread also takes into account its z-index location as well.

int i=blockIdx.yblockDim.y+threadIdx.y + blockIdx.zblockDim.z; (most likely terribly wrong, but just want to show what I want to do here, if possible)
int j=blockIdx.xblockDim.x+threadIdx.x + blockIdx.zblockDim.z;

Any ideas?

Regards,
Akmal

On any recent CUDA version with a cc3.0 or higher GPU, the maximum in the first grid dimension is not 65535 but 2^31-1 (run deviceQuery on your GPU).

Therefore for any recent GPU, the 2D grid maximum number of threads is:

2147483647655351024 = 144112988985492480 threads, i.e. more than 1e+17 threads.

  1. One possible approach, based on your apparent choice of threadblocks of dimension 32x32. This proposal depends on “extended” x grid dimension beyond 65535, available with cc3.0:

int idx = threadIdx.x+blockDim.xblockIdx.x;
int idy = threadIdx.y+blockIdx.y
blockDim.y+blockIdx.zgridDim.yblockDim.y;

Use the above indexing and launch 10M/32 (312500) blocks in x, 32 threads in x. Launch 100000/32 (3125) blocks in y, 32 threads in y. Launch 100 blocks in z. Use idx for the loop that moves linearly (row-wise) through memory. Use idy for the loop that moves vertically (column-wise) through memory.

One way to visualize this is to consider a 2D thread array, with the z-dimension representing “stacks” of the 2D entity. Standard 2D indexing is:

int idx = threadIdx.x+blockDim.xblockIdx.x;
int idy = threadIdx.y+blockDim.y
blockIdx.y;

The above gives us proper indexing into the first slice of the stack. Thereafter we only need to adjust for each slice in Z. My scheme above proposes to make the z slices effectively be extensions of y, therefore each slice in z adds 1 whole slice dimension in y, and that is what blockIdx.zgridDim.yblockDim.y is doing.

  1. another possible approach would use a similar idea but allow you to stay within 65535 blocks on x. For this we need to launch 1000 (or 1024) threads in x (and only 1 in y and z) in our threadblock dimensions.

int idx = threadIdx.x+blockDim.xblockIdx.x;
int idy = blockIdx.y+blockIdx.z
gridDim.y;

Launch 10000 blocks in x, with 1000 threads in x. Launch 10000 blocks in y, and 1000 blocks in z.

Rather than 10000 blocks in x, with 1000 threads per block, it will probably be a bit more efficient to launch 10M/1024 blocks in x, with 1024 threads (in x) per block. This will require the usual rounding-up for launch, and range checking in your kernel code.