Help, Efficient Local Reductions Of Images

Hi All,

we have a problem that we can’t figure out how to map efficiently to a CUDA device, we’ve spent some time on it and it is painful, and were wondering if anyone knew a way to solve this efficiently.

We want to find the sum of each N by N block in a 2D image, where the block size is variable, typically from 6x6 to 200x200. We do this by reducing a large image into a smaller image, where each output pixel holds the sum of a block in the input image.

We’ve tried a few approaches to this and don’t get anywhere near the performance of a straightforward array reduction. So for a 1083x1083x3 float image (about 4 million value) with a block size of 100x100, our current approach on a Geforce 330M computes in 8 ms, for an image block size of 25x25 we compute in 22ms, for 6x6 it takes 220ms!

Now the total compute and memory traffic should roughly be the same regardless of image block size, which it obviously isn’t. It should also be roughly comparable to the reduction array example shipped with the SDK, which can reduce a 4 million element array in around 0.8ms on the same hardware. The problem obviously different to a simple array reduction, as you have to compute image addresses and reduce each image block into a separate location.

The problem is how to map the reduction to the hardware. It probably needs several approaches, for small block sizes, a straight forward mapping of each thread to an output pixel and iterating over the input image block seems to work best.

Once the image block size gets bigger however, we allocate a CUDA thread block to each image block and reduce within that thread block. So for a 100x100 image block (10K pixels), we have a 512 thread block do a fairly standard reduction into shared memory then into the output pixel.

As the block size gets smaller this starts falling apart, as we end up having to schedule more thread blocks to cover the increased image blocks and there is not enough work to keep the threads busy and we hit coalescing issues as well. So with 25x25 image blocks the same algorithm takes about 22ms, and with 6x6 image blocks takes 220ms.

I’ve been trying to think of better ways to allocate threads blocks to image blocks without causing problems for coalescing or inter thread block communication, and I can’t. Someone else must have solved this problem, are there any suggestions?



What you write seems pretty sensible to me. I guess it’s all down to micro-optimizing the resulting kernels for different blocksizes then. One of the more important issues probably is how the image is laid out in memory - it would be advantageous if the three color components were in three different arrays as opposed to in float3 values.

One thing that might be worth looking into is using a texture. This would not only provide caching to alleviate coalescing problems for non-power-of-two blocksizes, but a texture fetch also allows to sum four adjacent pixels on dedicated hardware at no extra cost (by fetching the interpolated value exactly halfway between the four points).

I would do this with a summed-area table (integral image). Pre-integrate the image using scan (use CUDPP or Thrust), and then you can compute the sum over any block using just 4 lookups. There’s an example of this in CUDPP: