CLK_LOCAL_MEM_FENCE vs CLK_GLOBAL_MEM_FENCE

My initial understanding of these arguments to the barrier() function was that I should/could use ‘LOCAL’ when I want writes to local (=shared) memory to be visible for all threads in a work group, and ‘GLOBAL’ for writes to global memory. With this in mind I used the variant of barrier that seemed appropriate to me in place of syncthreads(), as I was porting some kernels from Cuda to OpenCL. As I was disappointed by severe performance degradations in a few places I started playing around and replaced all global barriers with local barriers. To my surprise the kernels still produce correct results and I get significantly better performance in some places (the most extreme case is one short kernel that runs 20x faster with local barriers instead of global barriers).

My assumption was that on Nvidia hardware the different variants of the barrier() function would simply produce the same instructions and it shouldn’t really matter which one I use. Do I misunderstand this? Does the behaviour I observe make sense?

EDIT: ok, I took a closer look at my kernels and I am now pretty sure that I didn’t really need a global barrier in any place. But the performance difference still puzzles me, in the context of syncthreads() in Cuda not being as expensive as a global barrier in OpenCL.