for example all threads (1024) perform atomicAdd() each differennt adress, then will it stall immadiately or while other group will perform another atomic on same address, or at the end of group? Or somewhere else at the life of kernel?
The concept of a stall should be thought of as applying to a warp. Instructions are issued warp wide (i.e. for 32 threads at a time) not at a width of 1024 threads or any other width.
An instruction stalls typically/usually when it has a dependency on a previous instruction, where the results of the previous instruction are not ready yet. This is a typical reason for a stall, there are others. (Unfortunately that is a rather dense treatment - scroll down to “Warp Stalls”
An atomicAdd
, by itself, with no other information, may not necessarily result in a stall. And when it is issued to a particular warp, that has no bearing on other warps. An overview of the instruction execution model is given in unit 3 of this online training series.