Atomics on Kepler

I have an application where I think I need to use atomicAdd(). In each kernel execution I need to accumulate something like 8 contributions into each address (the total number of addresses is basically the size of the video memory). I could perhaps avoid atomicAdd() altogether and split the operation into 8 non-conflicting kernel executions but this would then require 8 global reads and 8 global writes per address. Alternatively I could just scatter the contributions using atomicAdd() in global memory but if I try to maximize my cache hit rate then I will also cause conflicts/serialization of the atomic operations. A third option I can see is to read blocks of addresses into shared memory, scatter all the contributions into them using atomicAdd() and then write them back to global memory. In either of these cases I think I can eliminate intra-warp conflicts and inter-block conflicts and can also arrange it so that each thread will typically (though somewhat irregularly) scatter several contributions into the same address. The only remaining conflicts will be from other threads in other warps in the same block. Which approach is likely to work best? Will a user-managed cache (in shared memory) outperform the hardware-managed cache?