I have code an iterative algo with two technics:
1)While(!converg){
Kernelinvocation(…,converg);//update the global convergence flag : concurrent writing in global device mem
}
2)While(GPUReduce(Convergflags[nbthread])){
Kernelinvocation(…,Convergflags);//update a per thread convergence flag : no concurrency
}
In practic, the first method give best performance for resolution < 512²…Sob:/
So I would like to know if someone know How CUDA manage concurrent writing in global memory…
Have you read something about it?