Automate number of blocks and threads for block

Hi people, my kernel function must sum N element of a vector, one sum for each thread. Then, if I have N values, I must have N thread. How can I automate the number of blocks and threads for block in the kernel call? I had made:

if(coefBin<512){

                numBlocks.x=1;

                numThreads.x=coefBin;

        }else if( (coefBin/512)<=65535 ){

                numBlocks.x=(coefBin/512)+1;

                numThreads.x=32;

                numThreads.y=16;

        }else{

                numBlocks.x=65535;

                numBlocks.y=( (coefBin/512)/65535 )+1;

                numThreads.x=32;

                numThreads.y=16;

        }

and call: kernel<<<numBlocks,numThreads>>>(…); but doesn’t work. Can you help me? Thanks!

I have observe that the function doesn’t work when gridDim.x = 65535 and gridDim.y = 2 or 3 or 4 (for small numbers). Example:

valuesToBeAdded=49995000
DimBlocks=131070, DimThreadsForBlock=512

gridDim.x=65535, gridDim.y=2
blockDim.x=32, blockDim.y=16

(655352)(32*16)=67107840 threads > 49995000

Print results:

QVect_Dev_Ris[0].x=1.000000

QVect_Dev_Ris[24997500].x=11413.000000

QVect_Dev_Ris[49994999].x=-1998397155538108416.000000 ERROR!!!

Why?

If you define a vector vector[N] you probably try to acess vector[i] with i <0 or >N-1.
You should use cuda-memcheck to see if your program is trying to access memory outside of the buonds of the arrays inside your kernel. If you are on linux add -g -G to the compile command and then run “cuda-memcheck ./your_program”.

Then:

========= CUDA-MEMCHECK

dimN=49995000

NumBlocks=131070, NumThreads=512

NumBlocks.x=65535, NumBlocks.y=2

NumThreads.x=32, NumThreads.y=16

QVect_Dev_Ris[0].x=1.000000

QVect_Dev_Ris[24997500].x=11413.000000

QVect_Dev_Ris[49994999].x=-1998397155538108416.000000

========= ERROR SUMMARY: 0 errors

This is the output of cuda memcheck. Means that do not go out?

Yes. It appears that the accesses are ok, so you are just missing something. It is possible that you are never calculating the value QVect_Dev_Ris[49994999].x. In this case you get whatever it is there. You can check this by first initializing all array QVect_Dev_Ris with some number and then see if you get at the end that number.

But before the code worked. It is when I introduced the optimization on number of blocks and threads that don’t work. If, for example, I put a big number of threads, the code work. Example:

kernel<<<dim3(65535,2000,1),dim3(32,16,1)>>>(…); works!

kernel<<<dim3(65535,4,1),dim3(32,16,1)>>>(…); doesn’t work! However, it should work because the number of threads is major of adds that I must do.

I don’t know what do :/

No idea pasoleatis?