OpenCL semaphores again and again

I couldn’t find any way to implement something like semaphores with OpenCL due to the problem of limited GPU resources and lack of anything like OS threads scheduling on the GPU.
Then what might be the alternative to semaphores concept on the GPU? I mean there must be a way to perform multiple operations on the same piece of data without the interference of other threads? many applications would need this… if semaphores are not allowed so what’s the alternatives? I mean how could atomics be implemented then, as I understand it’s the same idea of semaphores, correct?
If not, does anyone have resources about how atomics work?

Perhaps something in this paper can be use use

Thanks a lot Martin it really helps :)