Ok, so no responses yet. I’ll let you know of the ways I tried:
Local histogram per thread: initial attempt, slow obviously because the histogram is stored in local memory.
Local histogram per thread in shared memory: faster by about 2x, but since my histogram has 128 bins, I can only launch 8 threads per block (I could launch 16 threads x 128 bins x 4 bytes per float = 1/2 of shared memory, but the experiments found it slightly slower than 8 threads per block).
Note that in both these cases, each thread was looking up upto 80x80 pixels in a texture. I’m fairly new to CUDA, but from the other forum posts I’ve read, it seems pretty obvious that I should be having more threads.
But I don’t think I can have multiple threads incrementing the same histogram in shared memory. Do you guys have any suggestions? I’ve kind of hit a wall on the issue :(
I was thinking about using the “software” atomic operation that’s in the Histogram256 example in the SDK, but it costs 5 bits to apply the thread tag. Since floats use 8 bits for the exponent, this would leave 18 bits for the fraction. Would this be on the right track / the only way to go? Any suggestions would be much appreciated!