A good puzzle. Lots of solutions, and it may be hard to choose without testing. You’re right that compute devices 1.2+ would solve your problems since they DO have shared atomics.
If you’re processing 256 bins per block, you could give each of 32 threads their own 256 bins in shared memory, accumulate, then finally use parallel reduction to make it a single 256 bin histogram. This would pretty much max out your shared mem use, and may not use your threads effectively.
How about this? Create two 256 bin arrays in shared memory. The first is your accumulator. The second is a notification array.
Clear notification array to FFFFFFFF. Syncthreads.
Each thread computes the bin it wants to increment.
A- Each thread writes its unique THREAD ID into the notification array, in the index which it wants to increment the accumulator.
Each thread checks the notification array. If its own thread ID is there, it increments the accumulator and resets the notifier to FFFFFF.
If any threads did NOT see their own ID, go back to the notify/check loop (labeled A) until all threads have done their increment.
Go back to the start, ready for the next increment.
This all relies on the fact that write collisions do guarantee that one write will succeed. It doesn’t matter who it is, just that everyone knows who’s going to increment.
This wouldn’t even be too slow, since collisions will be rare with the big 256 wide array.
Compute capability 1.2 devices can do atomic updates with shared memory… this would work very well for you though you say you need 1.1.