In my experience, atomics are still much slower than a proper tree reduction. Why? because of the collisions and instruction overhead. I looked at the Kepler assembly code for an atomic add and found out it’s implemented with a do-while loop consisting of 4 instructions:
/*0188*/ /*0x00b31c85c4000000*/ LDSLK P0, R12, [R11];
/*0190*/ /*0x04c300034800c000*/ @P0 IADD R12, R12, 0x1;
/*0198*/ /*0x00b30085cc000000*/ @P0 STSUL [R11], R12;
/*01a0*/ /*0x800021e74003ffff*/ @!P0 BRA 0x188;
If another thread modifies the same memory location that this thread is trying to update between the linked load and conditional store, then this thread will have to execute the 4 instructions over again. So if every thread in a warp try to update the same location, 1 unlucky thread would need to execute the loop 32 times, and maybe slow down the progress of the finished threads (interesting scheduling problem - do you want to prioritize the finished threads, or the unlucky thread(s)? ).
According to the Kepler data sheet, the throughput for an atomic operation is 64 / cycle for the entire chip when there are no conflicts, which is much lower than the shared memory bandwidth. This figure seems consistent with the 4 assembly instructions used because the throughput of a load/store instruction is 32 per SM, multiplied by 8 SMs, and divided by 4 instructions to execute = 64.
Does anyone know if it’s practical to implement a tree reduction in hardware so that you can handle multiple atomic updates to the same location in 1 clock cycle?