Thread sync

Hello,

is there a way to have all threads to wait at a barrier instruction?

What I need to do is to perform some calculations, wait for threads to finish,

and be sure they finished, and then start another set of calculations in the same kernel. Something like this:

for (int i = 1; i <=arbitrary_number; i++) {

//do stuff

//wait for threads to finish

}

basically, so that all threads will synchronize after every iteration of the loop. Is that possible?

Do I think correctly that __syncthreads only works for threads within a single block?

Crap, wrong forum, should be in " CUDA Programming and Development", sorry. If someone with power sees, could you move the topic?

It is true that __syncthreads() only synchronize threads within the same block. CUDA currently has no device-wide synchronization and to do that you will have to rely on another kernel launch - when the second kernel is launched, it is guaranteed that all things in the first kernel have already been finished.

And of course, kernel launches have their own overheads and solutions exist to reduce such overhead, but that’ll be another topic…

It might be possible that what you are looking for is just a memory fence function. You may take a look at __threadfence() and __threadfence_system()

Thanks. I settled for the most obvious solution - to run the kernel multiple times.