Python Image Transformation Acceleration with CUDA

I inherited a Python script that does a proprietary image transformation of a rectangular Mercator map into an image displaying two side-by-side polar views. The code takes about 2 minutes to do the transformation for each image and write it out to a file. I’ve created a Pool to spin up 32 parallel transformations on a system with 32 cores & 128GB of RAM but with thousands of images it still takes about 1.5 hours to finish each day.

Before: https://www.oceanic.udel.edu/globaldataview/OmniGlobe/NRL/wind_waves/wind_waves_0001.jpg

After: https://www.oceanic.udel.edu/globaldataview/OmniGlobe/NRL/wind_waves/polarpair/wind_waves_0001.jpg

I’m hoping to leverage nVidia’s CUDA technology to drastically speed up this process so this will be my first attempt at doing anything with Python & CUDA.

Note that the output image is 4 pixels narrower than the input image (a requirement of the hardware that will be displaying it) so a 4000x2000 input image results in a 3996x2000 output polar pair image.

Pertinent Python code follows - any assistance in a direction to take would be much appreciated.

@numba.jit
def findSrcX(x,y,height): # x/y around origin, so +- x and +-y where 0/0 is the pole finds the x pos of the src img
    if (x == 0):
        if (y >= 0):
            theta = np.pi / 2
        if (y < 0):
            theta = 3 * np.pi / 2
    elif (x < 0):
        theta = np.arctan(float(y)/float(x)) + np.pi
    elif (y < 0):
        theta = np.arctan(float(y)/float(x)) + 2 * np.pi
    else:
        theta = np.arctan(float(y)/float(x))
    srcX = theta * height / np.pi
    return srcX

@numba.jit
def findSrcY(x,y,height,outputHeight):
    return  np.sqrt(x**2 + y**2) * 2/outputHeight * height/2

@numba.jit
def findSrcXY(x,y,height,outputHeight):
    x = x - outputHeight / 2
    y = y - outputHeight / 2
    return (findSrcX(x,y,height), findSrcY(x,y,height,outputHeight))

@numba.jit
def EquirectangularToPolar(src_filename, output_directory=None, outputHeight=0, match=False):
    print("Processing " + src_filename)
    srcimg = Image.open(src_filename)
    width, height = srcimg.size
    if (width != 2 * height):
        LOG.warning(src_filename + " is not 2*1")
        return
    if match:
        outputHeight = height
        outputWidth  = width - 4
    else:
        outputWidth = outputHeight * 2 - 4

    LOG.debug("Creating canvas")
    polarimg = Image.new('RGB', (outputWidth, outputHeight))
    LOG.debug("Painting north pole onto canvas")
    [polarimg.putpixel((outputHeight - x, y), srcimg.getpixel(findSrcXY(x,y,height,outputHeight)))
    for x,y in np.ndindex((outputHeight, outputHeight))]
    LOG.debug("Flipping source image")
    srcimg = srcimg.transpose(Image.FLIP_TOP_BOTTOM) # to read south pole
    LOG.debug("Painting south pole onto canvas")
    [polarimg.putpixel((x + outputHeight - 4, y), srcimg.getpixel(findSrcXY(x,y,height,outputHeight)))
    for x,y in np.ndindex((outputHeight, outputHeight))]

A few comments:

  1. I think this should be readily doable with numba CUDA
  2. It’s not obvious to me there will be a huge speedup. It doesn’t look like very high arithmetic intensity per pixel, but it’s not super low, either, so it may be worth a test
  3. It seems like the mapping can be precomputed for a given (input) image size? By that I mean that if you have a 4000x2000 input image, then for every point (x,y) you should be able to compute mapping indices (x’, y’) once for that image size, then for each image of that size, use the array of mapping indices to do an indexed copy from input to output. Does that sound right? Are you generally processing images of a fixed size? If that sort of precomputation of the mapping indices is possible, then I definitely would follow that approach and leave this as regular numba (host) code, and not bother with a CUDA conversion.

The use of np.arctan here is a red flag. If you look closely I think you’ll find that np.arctan2 is the more appropriate tool for the job, as it returns the proper quadrant right away (-π ≤ arctan2(y,x) ≤ π) , leading to clearer and faster code.

  1. The image sizes may change - some users report converting 2400x1200 images for example. To that same question, my personal use appears to be strictly for 4000x2000 resolution images at the moment.

It had dawned on me that I could pre-compute the to-from pixel transformation and store it in an array which is stored as a transform file for use at runtime, and then only revert to the per-pixel calculations to create one if such a transformation array file didn’t already exist. Would an array mapping of all the to-from pixel locations benefit from being parallelized across a slew of CUDA cores?

One other suggestion I had received in the past was to “convert things to vectors” which are supposedly more readily optimized for calculations by Numba?

Thanks for the suggestion - the code I originally inherited used “math.pi” and “math.atan” which I switched to “np.pi” and “np.arctan” in hopes that some optimization would magically occur.

Happy to test and report the results if you can provide an example of what the changes would look like if using “np.arctan2” - I’m assuming it would eliminate some (all?) of the if >= and < 0 checks?

I would expect there will be minimal benefit from using CUDA for this, which is why I said

“If that sort of precomputation of the mapping indices is possible, then I definitely would follow that approach and leave this as regular numba (host) code, and not bother with a CUDA conversion.”

That is my assumption. For what it’s worth, math.atan2 exists as well. Sorry, I am not going to carefully reverse engineer the existing code to find out how exactly it would need to be transformed for np.arctan2. I think that whole chain of if-then-elses boils down to simply theta = np.arctan2 (float(y), float(x)). You can set up some quick scaffolding that checks that in isolation.

You are correct - the entire if/then/elif chain can be replaced by a single np.arctan2 statement:

@numba.jit
def findSrcX(x, y, height):  # x/y around origin, so +- x and +-y where 0/0 is the pole finds the x pos of the src img
    # if (x == 0):
    #     if (y >= 0):
    #         theta = np.pi / 2
    #     if (y < 0):
    #         theta = 3 * np.pi / 2
    # elif (x < 0):
    #     theta = np.arctan(float(y) / float(x)) + np.pi
    # elif (y < 0):
    #     theta = np.arctan(float(y) / float(x)) + 2 * np.pi
    # else:
    #     theta = np.arctan(float(y) / float(x))
    # srcX = theta * height / np.pi
    # theta = np.arctan2(float(y), float(x))
    srcX = np.arctan2(float(y), float(x)) * height / np.pi
    return srcX

No noticeable difference in performance though in my before/after testing - still - makes for more readable code.