I am new to CUDA. I am parallelizing a serial task where I am supposed to traverse over 10M bodies, comparing each body with BODIES-1 number of other bodies than itself. Hence the total iterations are 10M * 10M = 1e+14 iterations in a traditional nested for loop fashion.
The problem is, since the serial code deals with this as a 2D problem i.e. using 2 nested for loops, when I try to reach 1e+14 threads, I have to use 3D blocks and threads due to the limit of max number of blocks. i.e. the max I can get from 2D grid and block is (653556535532*32 = 4.3737866e+12 threads.
So, I’m using (10000100001000) 3D grid with 101010 threads for this 2D problem, which nicely amounts to 1e+14 threads. However, I have no clue how to do the thread indexing. I have a 3D grid and block, and I don’t know how to traverse over 2 nested for loops using this.
I know this much that I’ve got this:
And I need to somehow represent the z index like so that each thread also takes into account its z-index location as well.
int i=blockIdx.yblockDim.y+threadIdx.y + blockIdx.zblockDim.z; (most likely terribly wrong, but just want to show what I want to do here, if possible)
int j=blockIdx.xblockDim.x+threadIdx.x + blockIdx.zblockDim.z;