Giant histogram computation

I’m looking for a way to accumulate values on a large histogram on the order of 15,000 bins, much larger than the ones used in the demo applications.

A further complication is that the histogram needs to accumulate double floating point values and not just integers.

The serial version of the algorithm looks like this:

vector<double> hist(15000);

for(....)

{

  int bin_index = ... // some calculation

  double value = ... // some calculation

  hist[index] += value;

}

What is the best way to do this using CUDA?

I’m no histogram expert, but why would 15K doubles impose a problem?

15K * sizeof(double) is certainly something that you can manage.

Also, as suggested in this forum, using Fermi’s new features, especially L1 and L2, histogram work might be done faster…

eyal

I am working with compute version 1.3 and unfortunately don’t have a Fermi card available.

The main problem is that updating the histogram in the DRAM memory from multiple threads is impossible do in an efficient and error proof way.

pre-fermi hardware don’t have atomicAdd() that works with doubles and even if it did I expect it would work very slow.

One thing I’ve tried is instead of updating just one histogram to compute some random hash value in the range of [0…999], update one of a 1000 available histograms and then at the end to sum them all up. With this solution I am able to get better results (less collisions, faster run time) but still its not perfect because there are still collisions.

What’s the main performance killer? gmem access?

Maybe you can partition the 15K elements to, say, 15 chunks and work with smem. i.e. do the calculation

and based on the bin index (0…1000, 1001…2000,…) decide whether to save the result or disregard it.

At the end of the kernel you’ll have 1K elements in smem (with smem bank-conflicts but much better than gmem issues), dump

it to gmem and work on the next chunk (1001…2000) etc…

does that make sense?

I’d partition to the minimum chunk size possible (e.g., 8), since re-reading the data from gmem will likely be more costly than what improved occupancy / thread scheduling can make up for.

Do the bins really need to be double precision? That would probably kill performance as there are no double precision atomics, and no 64-bit atomics in shared memory at all. If float isn’t sufficient (and range not a problem), I could imagine that Kahan summation might do.