Hi people, my kernel function must sum N element of a vector, one sum for each thread. Then, if I have N values, I must have N thread. How can I automate the number of blocks and threads for block in the kernel call? I had made:
If you define a vector vector[N] you probably try to acess vector[i] with i <0 or >N-1.
You should use cuda-memcheck to see if your program is trying to access memory outside of the buonds of the arrays inside your kernel. If you are on linux add -g -G to the compile command and then run “cuda-memcheck ./your_program”.
Yes. It appears that the accesses are ok, so you are just missing something. It is possible that you are never calculating the value QVect_Dev_Ris[49994999].x. In this case you get whatever it is there. You can check this by first initializing all array QVect_Dev_Ris with some number and then see if you get at the end that number.
But before the code worked. It is when I introduced the optimization on number of blocks and threads that don’t work. If, for example, I put a big number of threads, the code work. Example: