Hello,

I am new to CUDA. I am parallelizing a serial task where I am supposed to traverse over 10M bodies, comparing each body with BODIES-1 number of other bodies than itself. Hence the total iterations are 10M * 10M = 1e+14 iterations in a traditional nested for loop fashion.

The problem is, since the serial code deals with this as a 2D problem i.e. using 2 nested for loops, when I try to reach 1e+14 threads, I have to use 3D blocks and threads due to the limit of max number of blocks. i.e. the max I can get from 2D grid and block is (65355*65355*32*32 = 4.3737866e+12 threads.

So, I’m using (10000*10000*1000) 3D grid with 10*10*10 threads for this 2D problem, which nicely amounts to 1e+14 threads. However, I have no clue how to do the thread indexing. I have a 3D grid and block, and I don’t know how to traverse over 2 nested for loops using this.

I know this much that I’ve got this:

int i=blockIdx.y*blockDim.y+threadIdx.y;
int j=blockIdx.x*blockDim.x+threadIdx.x;

And I need to somehow represent the z index like so that each thread also takes into account its z-index location as well.

int i=blockIdx.y*blockDim.y+threadIdx.y + blockIdx.z*blockDim.z; (most likely terribly wrong, but just want to show what I want to do here, if possible)

int j=blockIdx.x*blockDim.x+threadIdx.x + blockIdx.z*blockDim.z;

Any ideas?

Regards,

Akmal