block 0,0 i have 512 thread int sh= threadIdx.x+32threadIdx.y;
block 0,1 i need 512 to 1024 so +512 * blockIdx.x
block 0,2
block 0,3
…
block 0,2048
block 1,0 i have done 5122048 =1048576 so +1048576 * blockIdx.y
But then I accidentaly called the kernel in this way:
square_array <<< 1,190 >>>(…) where 190 is N!/(2!*(N-2)!) with N=20, yet it seems to me functions. I’m wrong to call the function like this or that’s okay? I think it works anyway. However, if wrong, how do I choose the correct number of threads for block and blocks according to N?
I take the post for a new extension of the trouble. The global function make by “cricri” works great, now I have the problem of sum more of 33553920 number (this is the numerb max of thread for my Tesla S2050). They told me to go to the kernel function a piece of data at a time in a cycle “for”, but I don’t know how to manage data :(. Help me please :( Thanks a lot!