Radiosity, multiple parallel reduction?

Hi everyone!
So , I’ve got my CUDA-based radiosity application (old fashioned hemicube way), and I’m looking for ways to improve performance. The most expensive part of code is computing (or rather summing) delta form factors. Basically my problem is this: I’ve got rendered image that I need to ‘process’. It is read by kernel as texture. ‘Pixels’ of this image represent indices to triangles. Example:

Image (8x8):
0 0 0 4 4 9 9 9
0 0 4 4 4 4 9 9
0 4 4 4 4 4 4 9
4 4 4 4 4 4 4 4
5 5 5 5 4 4 4 4
5 5 5 5 5 5 5 5
5 5 5 5 5 5 5 5
5 5 3 3 3 3 3 3

so in this image, there are rendered triangles nr 0,4,9,5 and 3. Ofcourse this is simplified example. In reality those can be as large as 2048x2048.

Every ‘pixel’ has a corresponding delta-form-factor value. So now, I need to sum all those delta-form-factor values that have the same ‘pixel’ (triangle index). I do this with atomic operations on global memory (compute capability < 1.2) or atomics on shared memory (CC >= 1.2). Every thread of kernel reads one ‘pixel’ and its corresponding delta-form-factor, and then atomicAdd this value to appropriate triangle’s sum (or temporary sum in case of shared memory version).

I’ve got 2 questions:

  1. In shared memory, if I need to atomicAdd some values, do I really need to use atomics, since there would be bank conflicts anyway, so wouldn’t those operations be serialized anyway?

  2. Is there some way to avoid atomic operations in this case, since obviously, they’re big performance-killer here. This looks something like parallel reduction but not for one sum, but for many sums at once. Or maybe I should take completely different approach to this problem?

Any help and ideas are appreciated, thanks :)