Convergence flags+Reduction OR concurrent global memory writing

I have code an iterative algo with two technics:

Kernelinvocation(…,converg);//update the global convergence flag : concurrent writing in global device mem

Kernelinvocation(…,Convergflags);//update a per thread convergence flag : no concurrency

In practic, the first method give best performance for resolution < 512²…Sob:/

So I would like to know if someone know How CUDA manage concurrent writing in global memory…
Have you read something about it?