SDK histogram256 on GTX280 SDK histogram256 performance on nVIDIA GeForce GTX280

I’ve recently bought a new nVIDIA GeForce GTX280. I’m really interested about the performance of shared memory and global memory atomic instruccions. So I have tested the histogram256 example from the CUDA SDK. This code shows three computation alternatives depending on the GPU compute capability:

  1. SM10: This implements atomic operations by software. It’s properly for devices with compute capability under 1.1.
  2. SM11: This uses atomic instructions for global memory. It works on devices with compute capability 1.1.
  3. SM12: This uses atomic instructions for shared memory. It works on devices with compute capability over 1.1.
    When I test histogram256 on the new GTX280, it results:

Using device 0: GeForce GTX 280
Initializing data…
…allocating CPU memory.
…generating input data
…allocating GPU memory and copying input data
Running CPU histogram…
histogramCPU() time : 83.110001 msec // 1147.484430 MB/sec
Running GPU histogram using SM10…
histogram256_SM10() time : 15.149406 msec // 6295.126615 MB/sec
TEST PASSED
Running GPU histogram using SM11…
histogram256_SM11() time : 15.133781 msec // 6301.626072 MB/sec
TEST PASSED
Running GPU histogram using SM12…
histogram256_SM12() time : 15.879219 msec // 6005.801123 MB/sec
TEST PASSED
Shutting down…

Press ENTER to exit…

As can be seen, execution time for the third option (which uses shared memory atomic instructions) is a little bit worse than the others. Why??? I expected to find a performance improvement from the shared memory atomic instructions. But I see that the code works worse with the new capabilities. What is happening???
Thanks in advance.
Juan