global barrier synchronization

Hi All,

Is it wise to implement global barrier synchronization? I think the requirement is to have a hardware that supports atomicAdd.
Can this be efficient at all?

I would appreciate some insights on this. Thanks in advance.
Does anyone know any implemented global barrier synchronization method?

Thanks in advance.

Atomic operations are not necessary to implement global barrier. However, you need some sort of memory consistency. CUDA 2.2 has memory fence (__threadfence) for this purpose, but it seems to be expensive.

According to my experience, global barrier without __threadfence runs in 1-2 us. Global barrier with __threadfence is a lot slower, may be 10+ us, and takes more time when synchronizing more thread blocks. In summary, no, it may be not wise on current GPUs.

This has been already discussed at this forum, see Also, see Section 3.8 in Volkov, V., and Demmel, J. “Benchmarking GPUs to Tune Dense Linear Algebra”, SC08.