global barrier synchronization

gpugpu · July 21, 2009, 7:17pm

Hi All,

Is it wise to implement global barrier synchronization? I think the requirement is to have a hardware that supports atomicAdd.
Can this be efficient at all?

I would appreciate some insights on this. Thanks in advance.
Does anyone know any implemented global barrier synchronization method?

Thanks in advance.

vvolkov · July 22, 2009, 5:06pm

Atomic operations are not necessary to implement global barrier. However, you need some sort of memory consistency. CUDA 2.2 has memory fence (__threadfence) for this purpose, but it seems to be expensive.

According to my experience, global barrier without __threadfence runs in 1-2 us. Global barrier with __threadfence is a lot slower, may be 10+ us, and takes more time when synchronizing more thread blocks. In summary, no, it may be not wise on current GPUs.

This has been already discussed at this forum, see http://forums.nvidia.com/index.php?showtopic=92819. Also, see Section 3.8 in Volkov, V., and Demmel, J. “Benchmarking GPUs to Tune Dense Linear Algebra”, SC08.

Vasily