# 3D Grid and 3D Block Thread Indexing for 2 Nested For Loops

Hello,

I am new to CUDA. I am parallelizing a serial task where I am supposed to traverse over 10M bodies, comparing each body with BODIES-1 number of other bodies than itself. Hence the total iterations are 10M * 10M = 1e+14 iterations in a traditional nested for loop fashion.

The problem is, since the serial code deals with this as a 2D problem i.e. using 2 nested for loops, when I try to reach 1e+14 threads, I have to use 3D blocks and threads due to the limit of max number of blocks. i.e. the max I can get from 2D grid and block is (653556535532*32 = 4.3737866e+12 threads.

So, I’m using (10000100001000) 3D grid with 101010 threads for this 2D problem, which nicely amounts to 1e+14 threads. However, I have no clue how to do the thread indexing. I have a 3D grid and block, and I don’t know how to traverse over 2 nested for loops using this.

I know this much that I’ve got this:

int j=blockIdx.x

And I need to somehow represent the z index like so that each thread also takes into account its z-index location as well.

int i=blockIdx.yblockDim.y+threadIdx.y + blockIdx.zblockDim.z; (most likely terribly wrong, but just want to show what I want to do here, if possible)

Any ideas?

Regards,
Akmal

On any recent CUDA version with a cc3.0 or higher GPU, the maximum in the first grid dimension is not 65535 but 2^31-1 (run deviceQuery on your GPU).

Therefore for any recent GPU, the 2D grid maximum number of threads is:

1. One possible approach, based on your apparent choice of threadblocks of dimension 32x32. This proposal depends on “extended” x grid dimension beyond 65535, available with cc3.0:

blockDim.y+blockIdx.zgridDim.yblockDim.y;

Use the above indexing and launch 10M/32 (312500) blocks in x, 32 threads in x. Launch 100000/32 (3125) blocks in y, 32 threads in y. Launch 100 blocks in z. Use idx for the loop that moves linearly (row-wise) through memory. Use idy for the loop that moves vertically (column-wise) through memory.

One way to visualize this is to consider a 2D thread array, with the z-dimension representing “stacks” of the 2D entity. Standard 2D indexing is: