Atomic addition for floats in shared memory application: weighted histogram

Is there a way to implement atomic addition for floats in shared memory?

I’ve looked at the histogram sample and it looks like the most significant 5 bits are used as a thread tag - which would work for unsigned ints, but what about floats?

This is for an implementation of a weighted histogram, where each entry is multiplied by a float weighting factor.

Ok, so no responses yet. I’ll let you know of the ways I tried:

  1. Local histogram per thread: initial attempt, slow obviously because the histogram is stored in local memory.

  2. Local histogram per thread in shared memory: faster by about 2x, but since my histogram has 128 bins, I can only launch 8 threads per block (I could launch 16 threads x 128 bins x 4 bytes per float = 1/2 of shared memory, but the experiments found it slightly slower than 8 threads per block).

Note that in both these cases, each thread was looking up upto 80x80 pixels in a texture. I’m fairly new to CUDA, but from the other forum posts I’ve read, it seems pretty obvious that I should be having more threads.

But I don’t think I can have multiple threads incrementing the same histogram in shared memory. Do you guys have any suggestions? I’ve kind of hit a wall on the issue :(

I was thinking about using the “software” atomic operation that’s in the Histogram256 example in the SDK, but it costs 5 bits to apply the thread tag. Since floats use 8 bits for the exponent, this would leave 18 bits for the fraction. Would this be on the right track / the only way to go? Any suggestions would be much appreciated!