Fermi atomic op 10 times slower than ATI GPU?

The test Beyond3D showed on this page Beyond3D - NVIDIA Fermi GPU and Architecture Analysis
suggests that Fermi’s GTX470 atomic is 12 times slower than ATI’s HD5870, for instance on shared memory, which is very surprising to me!

Does Anyone have any insight or comment on this? Thanks!

I wouldn’t put too much stock in synthetic tests like these. In real applications Fermi performs pretty well using atomic operations in my experience. I’m not sure why they’re testing shared memory atomic with no contention, why even use atomics in this case?

but… What about global atomics - the increment ones…?? Any comment on that?
(I hope that they are not doing something that the compiler is optimizing away…)

It would be interesting to see if atomicadd (shared mem.) without memory contention is as fast as a normal add…

here is a write-up by someone who thoroughly analyzed atomics performance.

http://strobe.cc/cuda_atomics/cuda_atomics.pdf