Hi All,
we have a problem that we can’t figure out how to map efficiently to a CUDA device, we’ve spent some time on it and it is painful, and were wondering if anyone knew a way to solve this efficiently.
We want to find the sum of each N by N block in a 2D image, where the block size is variable, typically from 6x6 to 200x200. We do this by reducing a large image into a smaller image, where each output pixel holds the sum of a block in the input image.
We’ve tried a few approaches to this and don’t get anywhere near the performance of a straightforward array reduction. So for a 1083x1083x3 float image (about 4 million value) with a block size of 100x100, our current approach on a Geforce 330M computes in 8 ms, for an image block size of 25x25 we compute in 22ms, for 6x6 it takes 220ms!
Now the total compute and memory traffic should roughly be the same regardless of image block size, which it obviously isn’t. It should also be roughly comparable to the reduction array example shipped with the SDK, which can reduce a 4 million element array in around 0.8ms on the same hardware. The problem obviously different to a simple array reduction, as you have to compute image addresses and reduce each image block into a separate location.
The problem is how to map the reduction to the hardware. It probably needs several approaches, for small block sizes, a straight forward mapping of each thread to an output pixel and iterating over the input image block seems to work best.
Once the image block size gets bigger however, we allocate a CUDA thread block to each image block and reduce within that thread block. So for a 100x100 image block (10K pixels), we have a 512 thread block do a fairly standard reduction into shared memory then into the output pixel.
As the block size gets smaller this starts falling apart, as we end up having to schedule more thread blocks to cover the increased image blocks and there is not enough work to keep the threads busy and we hit coalescing issues as well. So with 25x25 image blocks the same algorithm takes about 22ms, and with 6x6 image blocks takes 220ms.
I’ve been trying to think of better ways to allocate threads blocks to image blocks without causing problems for coalescing or inter thread block communication, and I can’t. Someone else must have solved this problem, are there any suggestions?
tx
Bruno