Convergence flags+Reduction OR concurrent global memory writing

I have code an iterative algo with two technics:

1)While(!converg){
Kernelinvocation(…,converg);//update the global convergence flag : concurrent writing in global device mem
}

2)While(GPUReduce(Convergflags[nbthread])){
Kernelinvocation(…,Convergflags);//update a per thread convergence flag : no concurrency
}

In practic, the first method give best performance for resolution < 512²…Sob:/

So I would like to know if someone know How CUDA manage concurrent writing in global memory…
Have you read something about it?