I am working with compute version 1.3 and unfortunately don’t have a Fermi card available.
The main problem is that updating the histogram in the DRAM memory from multiple threads is impossible do in an efficient and error proof way.
pre-fermi hardware don’t have atomicAdd() that works with doubles and even if it did I expect it would work very slow.
One thing I’ve tried is instead of updating just one histogram to compute some random hash value in the range of [0…999], update one of a 1000 available histograms and then at the end to sum them all up. With this solution I am able to get better results (less collisions, faster run time) but still its not perfect because there are still collisions.
I’d partition to the minimum chunk size possible (e.g., 8), since re-reading the data from gmem will likely be more costly than what improved occupancy / thread scheduling can make up for.
Do the bins really need to be double precision? That would probably kill performance as there are no double precision atomics, and no 64-bit atomics in shared memory at all. If float isn’t sufficient (and range not a problem), I could imagine that Kahan summation might do.