In which way can I syncronize all the threads of all the grids (or of a thread block) that i launch to execution? that it to say, I want to stop all the threads in one point until the last thread comes to these point, and then, all the threads start of new his execution. May I do this with “__syncronize()”??? I do it, but it’s doesn’t work perfectly.
I have a second question: how can i know in a point of my kernel code, the last thread that past above this point? I tried it with a device variable that takes the count of the number of threads that are passing in this point, but it doesn’t work.
__synchronize will only sync threads in a single block, not amongst multiple blocks. When your CUDA function returns, you can synchronize all blocks with a cudaThreadSynchronize() call. Then, make a second function call.
To your second:
If you have a device variable, and every thread does passed = passed + 1, what is the value they incremented? It’s nondeterministic.
sm_11 supports atomic functions–try one of those out.