Is there a way to make camera undistort (cv2 fisheye) happen faster?

I went through the steps to calibrate my camera’s wide angle lens using cv2’s fisheye module.

I basically followed the instructions on https://medium.com/@kennethjiang/calibrate-fisheye-lens-using-opencv-333b05afa0b0 and then added the undistorting function into JetBot’s camera.py

Err… surprise! Frame rate dropped to crap. CPU load average went to like 5 and top said python3 started to take 125% of CPU. (power mode 10W)

Sooo… I’m hoping that since I’m on a Tegra… there’s a way to tell the GPU to help out OpenCV?

The relevant code snippet looks like this

import numpy as np
import cv2

def undistort_image(img, DIM, K, D, balance=0.0, dim2=None, dim3=None):
    dim1 = img.shape[:2][::-1]  #dim1 is the dimension of input image to un-distort
    if not dim2:
        dim2 = dim1
    if not dim3:
        dim3 = dim1
    scaled_K = K * dim1[0] / DIM[0]  # The values of K is to scale with image dimension.
    scaled_K[2][2] = 1.0  # Except that K[2][2] is always 1.0
    # This is how scaled_K, dim2 and balance are used to determine the final K used to un-distort image. OpenCV document failed to make this clear!
    new_K = cv2.fisheye.estimateNewCameraMatrixForUndistortRectify(scaled_K, D, dim2, np.eye(3), balance=balance)
    map1, map2 = cv2.fisheye.initUndistortRectifyMap(scaled_K, D, np.eye(3), new_K, dim3, cv2.CV_16SC2)
    undistorted_img = cv2.remap(img, map1, map2, interpolation=cv2.INTER_LINEAR, borderMode=cv2.BORDER_CONSTANT)
    return undistorted_img # this returns a np array

and calling cv2.fisheye.undistortImage(…) directly doesn’t work, I get all black, various small optimizations have been attempted but nothing gets a significant increase in frame rate

Hi frank26080115,

The Tutorial you’re referring to is from 2017, we never tested it and can’t provide suggestions.
May other developer share experience if they did it before.

Thanks

@frank26080115

Most of the functions in your “undistort_image” function do not actually use the image data so they may be moved out of this function. Depending on how “expensive” the estimateNewCameraMatrixForUndistortRectify and “initUndistortRectifyMap” functions are, this may save a bit of performance.

The remap function will likely still take up most of the performance. It might be possible to improve performance by using cv2.INTER_NEAREST, though this will produce a lower quality image.

Those are SUPER EXPENSIVE and should only be done once. Once you have an undistortion map, you should save it to disk, so you can just load the map data the next time the program starts.

The map is in the format of “delta-X, delta-Y” for each pixel. The OpenCV implementation isn’t super fast, so you can actually gain some CPU cycles by implementing it in raw C with a cache-aware optimized implementation.

But, given that the map is in a simple format, you should be able to write a simple shader in OpenGL or CUDA to sample the right pixels. If the image is already available on the GPU, this is likely to run even faster. However, if you need to upload the image to the GPU to do the undistort, then that will take longer than just doing it on the CPU, at least in a reasonable C implementation. The algorithm is really quite simple, and runs fast, even using linear interpolation in the undistort map.

Thank you! Caching map1 and map2 did the trick and the framerate is back to being good.

On Tegra you can use shared GPU CPU memory and just pass a pointer around. The Jetson inference library does this to avoid the expensive copy the other dog is referring to.

You can examine how it’s done there and it works even in python. You can load an image from disk and pass a container with a pointer to some memory around and it’s faaaaast.

Found a forum thread:

https://devtalk.nvidia.com/default/topic/932957/zero-copy-and-managedmemory-%20on-jetson/?offset=2

do we have to sniff butts now?

Haha. Only if you want to.

Something worth mentioning from that thread is that in this zero copy mode CPU and GPU cache are disabled so you if you do end up experimenting with it, you may want to time things. In the case of dusty_nv’s Jetson inference, it’s much faster with iirc him mentioning somewhere.

Edit: found a link with some example code.
https://arrayfire.com/zero-copy-on-tegra-k1/

It’s the AGP “USWC” write combined aperture come back to haunt us!

I never wrote anything for gpu during that era so I guess I missed out on how all that works. The closest I came was getting my laptop’s agp gpu to work in Linux which at the time required kernel patching since it wasn’t very well supported. It was an amd igp of some sort and not very good. My first agp desktop didn’t run accelerated X at all (4mb igp). I think I ended up using a pci video card for a while since it had no agp slots. Hard times. I ended up putting a Voodoo2 in there as well but I don’t recall if it ever worked in Linux or not. That card was fast af.