Hello,

I am trying to apply a function called “compute” to each rectangle window of a 2D array called “heights”. It can be thought as customized convolution applied to 2D array. There is NO dependency between each call, so theoretically it should be highly parallelize. How I can make the double for loop in the run function to be run in parallel? or equivalently if I can write a kernel to do that?

We can pass “numpy” or “cupy” instead of “xp” argument so that it is compatible for both module.

```
def compute(foot_print, xp):
"""
foot_print is a 2D array
"""
x, y = foot_print.shape
foot_print_max = xp.max(foot_print)
r = xp.sum(xp.array((foot_print == foot_print_max), dtype=float)) / (x * y)
return r
```

```
def run(heights, window_length, window_width, xp):
"""
heights is a 2D array
"""
length, width = heights.shape
stability_array = xp.zeros_like(heights, dtype=float)
# ------------------------------
for r in range(length - window_length + 1):
for c in range(width - window_width + 1):
foot_print = heights[r:r + window_length, c:c + window_width]
stability_array[r, c] = compute(foot_print, xp=xp)
# ------------------------------
return stability_array
```