atomic performance under fermi need a solution to scattered write problem

Hi, I have an application that will do scattered write. Each thread will write to random memory locations. We did some test using atomic operation on the GTX 295 card before and the preformance is very bad. Now I heard that the fermi card have the cache that would speed up the atomic operation tremendeously. I don’t have a card at my hand but I am wondering how fast it can be. Is atomicAdd 10 times slower than normal write or better?
Also,does the performance depends on the degree of collision? Suppose I have 1000 threads write at the same time, 10 thread collisions vs 100 collisions, each collision have 2 threads writing to same location, will the performance be the same or the less number of collision run faster? What if in one case, there are 10 threads writing to the same location, will that run slower?

AtomicAdd is much faster with Fermi thanks to the L2 cache. In cases with minimal collisions, the speed of atomicAdd is about 20x better on Fermi than the GTX 200 series. Because atomics (by definition) require serializing accesses to the same memory location, your performance does depend on the collision rate.

The performance is dependent on the number of collisions as it takes time to resolve the conflicts.

Generally it is a good idea to avoid atomics if you are doing a lot of them or expecting a lot of collisions.

I find the speed of atomics on Fermi now to be sufficient for general purpose histogramming, at least in cases where I have > 50 bins. There are more sophisticated histogram techniques to reduce the usage of atomics, but when the histogram filling operation itself is not the primary bottleneck, a simple atomicAdd is pretty good.