Custom convolution applied to 2D array in Cupy/ parallelize double for loop

Hello,

I am trying to apply a function called “compute” to each rectangle window of a 2D array called “heights”. It can be thought as customized convolution applied to 2D array. There is NO dependency between each call, so theoretically it should be highly parallelize. How I can make the double for loop in the run function to be run in parallel? or equivalently if I can write a kernel to do that?
We can pass “numpy” or “cupy” instead of “xp” argument so that it is compatible for both module.

def compute(foot_print, xp):
    """
    foot_print is a 2D array
    """
    x, y = foot_print.shape
    foot_print_max = xp.max(foot_print)
    r = xp.sum(xp.array((foot_print == foot_print_max), dtype=float)) / (x * y)
    return r
def run(heights, window_length, window_width, xp):
    """
    heights is a 2D array
    """
    length, width = heights.shape
    stability_array = xp.zeros_like(heights, dtype=float)
    # ------------------------------
    for r in range(length - window_length + 1):
        for c in range(width - window_width + 1):
            foot_print = heights[r:r + window_length, c:c + window_width]
            stability_array[r, c] = compute(foot_print, xp=xp)
    # ------------------------------
    return stability_array

You should be able to get started by putting each series of operations in CuPy [Streams], (Basics of CuPy — CuPy 12.2.0 documentation). Remember that operations in a stream run serially and multiple streams can run in parallel (assuming there’s enough resources available (e.g., shared memory, registers, number of concurrent streams, etc).

In its current form, you compute kernel is at lease 5 different kernels. I think what you are trying to do can be accomplished with CuPy’s Elementwise functionality, and see good speed ups. You can find several examples here.

Also, you might submit an issue on CuPy’s Github for more help.

Thanks for your response and for pointing out the stream.

But regarding CuPy’s Elementwise, the input to compute function
is a 2D array, not a single number; How Elementwise can help? even if I only pass the top left corner, and the shape of that 2D array, how can I access all of its elements inside compute function?