Computing 'local-box' statistics for 3D images

Suppose I have raw 3D grayscale “image” (i.e. all voxels in memory consecutively), and I want to compute, for each voxel, some statistics regarding a fixed-size cube/box surrounding that voxel. min, max, average, median, quartiles, etc. (not necessarily all at once).

Obviously I can write something like this myself and optimize for my specific scenario and parameters. But - I’d rather not reinvent the wheel.

So my question is: Are there libraries with kernels which do this kind of work?

Note: It’s not quite image processing, since these are not quite images in the usual sense, but perhaps the space of impage processing sofrtware, which I know little about, has something like this? Any pointers would be welcome.

Are you using CUDA or OpenCL or some other API to do this?

@nikhilj : I’m asking about CUDA, although if somehow there’s something that’s only available in OpenCL - that could be relevant too.