Atomic operations are not necessary to implement global barrier. However, you need some sort of memory consistency. CUDA 2.2 has memory fence (__threadfence) for this purpose, but it seems to be expensive.
According to my experience, global barrier without __threadfence runs in 1-2 us. Global barrier with __threadfence is a lot slower, may be 10+ us, and takes more time when synchronizing more thread blocks. In summary, no, it may be not wise on current GPUs.
This has been already discussed at this forum, see http://forums.nvidia.com/index.php?showtopic=92819. Also, see Section 3.8 in Volkov, V., and Demmel, J. “Benchmarking GPUs to Tune Dense Linear Algebra”, SC08.