2D cross correlation


I’m looking to do 2D cross correlation on some image sets. Typical image resolution is VGA with maybe a 100x200 template. I’m looking into OpenVIDIA but it would appear to only support small templates. I was hoping somebody could comment on the availability of any libraries/example code for my task and if not perhaps the suitability of the task for GPU acceleration. I am currently using the intel IPP ippiCrossCorr function, but I’m hoping to accelerate with CUDA.


Speaking from personal experience, implementing a 2D normalised cross-correlation is relatively efficient - although I’m yet to get ‘significant’ performance improvements over IPP - depending on the case I’m anywhere between 20% faster to 20% slower (depending on pixel format of the image/template, and size of the template). So yes, you’d likely get increases from using CUDA.

As for existing libraries that implement this, I’m not aware of any…

20% over IPP is a bit dissapointing. Have you any idea where the bottleneck is?

fyi my images are 8bit greyscale and the correlation dosn’t need to be normalised. I could in fact get away with sum of absolute differences, perhaps there is code for this as it’s a common video compression step.

Here is a stereo matching cuda kernel that uses SSD. It may be of use to you. It is from NVIDIA.

You could use the instruction: __usad. See the documentation for this. It’s prety fast.

Perhaps look for another algo?
I implemented a really fast 2d correlation function with Cuda’s FFT lib.
Good old style: iFFT(FFT(image) * FFT(what your looking for))
Do normalization on phase, then search for maxima, then do a normalized traditional correlation on the result for ensuring ?

I should note, my kernels still have room for optimization - they’re currently processor bound (each thread has a loop which iterates (for my case, which is a 9x9 template against 30x30 source regions) 9*9 times).

That kernel at the moment only has 66% occupancy, and I could do more tricks to share results between threads - a) giving me more processing power to work with, and B) reducing the kernel’s instruction count up to 9 times in this case).

I’m guessing I could probably get double the speed of IPP if it were a priority for me - and far more than double the speed for larger source/template sizes - however the cross correlation in my application isn’t the largest bottleneck - thus I haven’t given it that much attention.

Hi DaManu, is there any chance of posting your code? If not what sort of performance are you seeing?

Hi, I have implemented NCC on GPU where; for smaller template and image size I am getting a 1.5~2X speedup &

for larger sizes a speedup of 6X is achievable. my observation says that if u do correlation on GPU and normalization on CPU, the results are good.

  • small size : Image = 32X32 template = 16X16

  • larger : Image=512X512 template = 64X64

using cuFFT.

How can I implement the classical iFFT[FFT(images1)*FFT(images2)] the easiest way ?

Another question :

does anyone have a link where a correlation is done with CUDA ?

The typical difference between correlation using time domain and that using frequency domain is going from On^2 to On log2(n), assuming you can zero-pad your data sizes to be of size 2^k.