I’m looking to do 2D cross correlation on some image sets. Typical image resolution is VGA with maybe a 100x200 template. I’m looking into OpenVIDIA but it would appear to only support small templates. I was hoping somebody could comment on the availability of any libraries/example code for my task and if not perhaps the suitability of the task for GPU acceleration. I am currently using the intel IPP ippiCrossCorr function, but I’m hoping to accelerate with CUDA.

Speaking from personal experience, implementing a 2D normalised cross-correlation is relatively efficient - although I’m yet to get ‘significant’ performance improvements over IPP - depending on the case I’m anywhere between 20% faster to 20% slower (depending on pixel format of the image/template, and size of the template). So yes, you’d likely get increases from using CUDA.

As for existing libraries that implement this, I’m not aware of any…

20% over IPP is a bit dissapointing. Have you any idea where the bottleneck is?

fyi my images are 8bit greyscale and the correlation dosn’t need to be normalised. I could in fact get away with sum of absolute differences, perhaps there is code for this as it’s a common video compression step.

Perhaps look for another algo?
I implemented a really fast 2d correlation function with Cuda’s FFT lib.
Good old style: iFFT(FFT(image) * FFT(what your looking for))
Do normalization on phase, then search for maxima, then do a normalized traditional correlation on the result for ensuring ?

I should note, my kernels still have room for optimization - they’re currently processor bound (each thread has a loop which iterates (for my case, which is a 9x9 template against 30x30 source regions) 9*9 times).

That kernel at the moment only has 66% occupancy, and I could do more tricks to share results between threads - a) giving me more processing power to work with, and B) reducing the kernel’s instruction count up to 9 times in this case).

I’m guessing I could probably get double the speed of IPP if it were a priority for me - and far more than double the speed for larger source/template sizes - however the cross correlation in my application isn’t the largest bottleneck - thus I haven’t given it that much attention.

The typical difference between correlation using time domain and that using frequency domain is going from On^2 to On log2(n), assuming you can zero-pad your data sizes to be of size 2^k.