Update of a 2x2 region of grid

Here is a problem I have come across, and while I have a decent solution which works, I was wondering if there is a better way.

Imagine a large grid, for example 1200x2000(stored in 1D contiguous memory), and I essentially need to perform an atomic update(an atomicAdd() for example) on an adjacent 2x2 region(x,y), (x+1,y), (x,y+1), (x+1,y+1). The values added to each location in that 2x2 area will all be different based on some other input values loaded into shared memory.

From loaded input values loaded from an input table I will get the base output (x,y) locations for a given thread, and then I need to update that region(or the portions of that region which are in bounds) with a corresponding set of different values.

Unfortunately it is not feasible to do this in reverse, where I could examine a pixel location in the output buffer and then derive “what values go here?”.

The input data set can vary quite a bit, and that input data per thread block will map out to writes in a rectangular region from size (0,0) to (336,32) for example. Because the upper bound of that write region area can be large, it makes using shared memory as a temporary scratch pad complicated and not faster than my current implementation.

Another problem is that the number of memory updates per thread block varies by a large amount, again depending on somewhat random real-time changing input data.

I have engineered this enough to get all the writes performed by a thread block to be in the same ‘general’ rectangular region, but those writes are not enough coalesced because the strides per region(between consecutive base regions) can be larger than 2 elements in either direction(x and y).

The current implementation is getting ok memory bandwidth, but there is plenty of room for improvement.

I cant imagine that I am the first to come across such a problem, so maybe someone can suggest a better possible approach?

This seems like a tough problem, and having only your description as input, I don’t have any suggestions for improvement.

My usual recommendation when one gets into a tight spot locally while optimizing software (CPU or GPU, doesn’t matter), is to “zoom out” and look at the larger picture of what is to be accomplished. Are there any alternative ways of going about the larger task that allow one to completely avoid the local problem or at least mitigate it? Such methods may be either known from literature, or can be freshly invented (patent material!) Sometimes old approaches, discarded and forgotten years ago may be worth resurrecting in the context of computing on a SIMT platform like CUDA.

Yes, one of those situations where I have more control over the less important aspect(the reads of input data), and less control over the more important aspect(the groups of memory updates/writes).

Definitely need to “zoom-out” in order to avoid the situation where one “cannot see the forest for the trees”.

what is this - a form of update/ scatter, with the possibility of having to update/ scatter to the same output destination multiple times?

you seem to have developed a design - one in a set of many - based on the imposed premise of input data (already) in shared memory
is it truly mandatory to read the data into shared memory first?
i get the sense that it is limiting other designs that may be able to parallelize the problem more, because the initial reading of data to shared memory localizes the data - a case of 1 step forward, 2 steps back

i would also establish whether the mandatory use of axioms can not be removed, one way or another
do you know the maximum count the same output address is written to?
if this count is low, a hypothetical possibility is to sub-divide the input table into smaller tables, such that each table writes to an (the same) output address only once; the premise i am assuming here is that the order of operations hardly matter (it seems to be merely summations)

also, if the update part of update/ scatter operation hardly have inter-dependencies between the 2x2 areas, i would also determine whether there would not be benefits in splitting it into 2 separate 2x1 operations, in an attempt to move away from 2d rectangles to 1d lines/ intervals

If I understood your description correctly you would like to manage all the scattered += output operations in shared memory (atomicAdd) meanwhile the output locations are data dependent yet limited to a given region?

If so, there might be an approach worth testing,

→ load all the input data that will potentially map into a given output region (0,0) → (336,32) into registers (consider multiple column elements per thread and several thread groups → loop unrolling).
→ Next you divide and conquer over the 32x336 output region, ie only output to for example a 36x48 region at a time,
→ loop over you input data, atomicAdd into smem-sub region IF this thread has any register data that should go into that region (use lots and lots of ternary ops…),
→ output contents of 36x48 SMEM into global memory,
→ iterate and move to next output region, again loop over registers,
→ check if the data belongs to that region → atomicAdd, and so forth…

If you can avoid register spilling you should be able to very quickly loop over your registers and determine if they belong to that output region meanwhile since the register file on the GPU is so large you’re able to keep a large portion of the input data that covers the full output region.

Hope this makes sense.

i believe the constraint still would be the possibility that the same output address may be touched multiple times, whether in shared memory or global memory; i.e. the use of atomics

when no dependencies apply, a 2x2 operation also equates to 2 1x2 or 4 1x1 operations

assuming that it is sensible to allow the host to assist:

given a batch size equal to the thread block size given the input table, given the output array/ matrix, for each 2x2 operation broken down into multiple 1x1 operations, for each corresponding output address fill the batch based on the following constraints: no output address may occur multiple times in the batch - all output addresses in the batch must be unique the distance between output addresses in the bin should be minimum create batches until all (1x1) operations pertaining to output addresses are absorbed

now, by further dividing the primary (1x1) operations pertaining to output addresses into bins, the problem can be hyper parallelized - a separate thread block can be run per bin; multiple thread blocks can be run on the problem simultaneously

given a number of bins given the input table, given the output array/ matrix, for each 2x2 operation broken down into multiple 1x1 operations, for each corresponding output address for an array associated with each bin, move the 1x1 operation to the corresponding bin array, if it falls within a particular bin's boundaries

Sounds vaguely similar to forward warping of an image with an motion field.

Check out Fast GPU-based image warping and inpainting for frame interpolation | hgpu.org