I have a need to apply a 7x7 template to a 640x512 16-bit image several million times and need to speed it up as much as possible. To start with I was using the normalized cross correlation (NCC) function that Matlab provides, normxcorr2. This was deadly slow. I found a faster implementation called normxcorr2_mex that Daniel Eaton wrote which is basically a Matlab wrapper to the OpenCV library’s implementation of the NCC (it can be found on his website here http://www.cs.ubc.ca/~deaton/remarks_ncc.html).

The speed still wasn’t suitable for real-time or better analysis, so I sought to implement my first CUDA kernel to perform NCC. As a base I took OpenVIDIA’s CUDA Vision Workbench (http://openvidia.sourceforge.net/index.php/OpenVIDIA) and modified their 16-bit 7x7 convolution kernel to do the NCC.

The results on a Intel Core i7 965 Extreme and GTX 295

normxcorr2 (native Matlab implementation): 83ms

normxcorr2_mex (Matlab wrapper to OpenCV): 25ms

normxcorr2_cuda (CUDA implementation): 1ms

While I’m happy with the results, perhaps there is more optimization to be done. The OpenCV implementation is only single threaded, while the i7 has 8 cores, so theoretically the CPU implementation could get down to 3-4ms.

To breakdown the CUDA timing further:

CPU–>GPU: 0.225ms

Computation: 0.522ms

GPU–>CPU: 0.216ms

Do these look like good numbers? Thanks for any input.