In which way can I syncronize all the threads of all the grids (or of a thread block) that i launch to execution? that it to say, I want to stop all the threads in one point until the last thread comes to these point, and then, all the threads start of new his execution. May I do this with “__syncronize()”??? I do it, but it’s doesn’t work perfectly.

I have a second question: how can i know in a point of my kernel code, the last thread that past above this point? I tried it with a device variable that takes the count of the number of threads that are passing in this point, but it doesn’t work.

In CUDA all are problems! I’m getting crazy!

To your first question:

__synchronize will only sync threads in a single block, not amongst multiple blocks. When your CUDA function returns, you can synchronize all blocks with a cudaThreadSynchronize() call. Then, make a second function call.

To your second:

If you have a device variable, and every thread does passed = passed + 1, what is the value they incremented? It’s nondeterministic.

sm_11 supports atomic functions–try one of those out.

Why don’t they call that function cudaBlockSynchronize?

It would make more sense.

Yes, you must allow the kernel to complete execution. Start the next phase of your algorithm in a separate kernel.