I wrote a cuda program which takes an arbitrarily size RGBA image and copies into a smaller image. While it does this it must do a sort of “bilinear filter” to average the pixels. The operation is quite simple:
Determine scale/ratio of src vs. dst image (ie: 8x8 to 4x4 is scale 2).
For each source pixel, divide by scale^2, and add to destination. So in the above example we’d be taking 4 pixels, scaling one each by 0.25, and adding them together to produce the result pixel.
I’m running one thread per source pixel. The problem of course is this is completely not thread safe, many threads are reading-to and writing-from the same destination pixel. It’s also going to be terrible for memory coalescing.
I think what I need to do is allocate some shared memory and break this up into two passes. Wanted to get some feedback on what a good approach would be? I’m almost thinking I should rewrite it so that I run one thread per destination pixel, and do a gather-operation instead. This won’t be well coalesced either, though.