I have code an iterative algo with two technics:
Kernelinvocation(…,converg);//update the global convergence flag : concurrent writing in global device mem
Kernelinvocation(…,Convergflags);//update a per thread convergence flag : no concurrency
In practic, the first method give best performance for resolution < 512Â²…Sob:/
So I would like to know if someone know How CUDA manage concurrent writing in global memory…
Have you read something about it?