Sum of N numbers in parallel in pairs without repetition.

int sh= threadIdx.x+32*threadIdx.y;
int tid = 512 * blockIdx.x + 1048576 * blockIdx.y + sh;

square_array <<< dim3(2048,32,1),dim3(32,16,1) >>> (memoiregraphique1, memoiregraphique2,N,N2);

i call 204832 block on each 3216 thread

block 0,0 i have 512 thread int sh= threadIdx.x+32threadIdx.y;
block 0,1 i need 512 to 1024 so +512 * blockIdx.x
block 0,2
block 0,3

block 0,2048
block 1,0 i have done 512
2048 =1048576 so +1048576 * blockIdx.y

But then I accidentaly called the kernel in this way:

square_array <<< 1,190 >>>(…) where 190 is N!/(2!*(N-2)!) with N=20, yet it seems to me functions. I’m wrong to call the function like this or that’s okay? I think it works anyway. However, if wrong, how do I choose the correct number of threads for block and blocks according to N?

I take the post for a new extension of the trouble. The global function make by “cricri” works great, now I have the problem of sum more of 33553920 number (this is the numerb max of thread for my Tesla S2050). They told me to go to the kernel function a piece of data at a time in a cycle “for”, but I don’t know how to manage data :(. Help me please :( Thanks a lot!

cricri, why the code does not work when I run a grid larger than 65535 blocks? Example:

kernel<<<dim3(65535,1000,1),…>>>(…)

???