Please comment on timings for my normalized cross correlation kernel I'd like to know if these a

I have a need to apply a 7x7 template to a 640x512 16-bit image several million times and need to speed it up as much as possible. To start with I was using the normalized cross correlation (NCC) function that Matlab provides, normxcorr2. This was deadly slow. I found a faster implementation called normxcorr2_mex that Daniel Eaton wrote which is basically a Matlab wrapper to the OpenCV library’s implementation of the NCC (it can be found on his website here http://www.cs.ubc.ca/~deaton/remarks_ncc.html).

The speed still wasn’t suitable for real-time or better analysis, so I sought to implement my first CUDA kernel to perform NCC. As a base I took OpenVIDIA’s CUDA Vision Workbench (http://openvidia.sourceforge.net/index.php/OpenVIDIA) and modified their 16-bit 7x7 convolution kernel to do the NCC.

The results on a Intel Core i7 965 Extreme and GTX 295
normxcorr2 (native Matlab implementation): 83ms
normxcorr2_mex (Matlab wrapper to OpenCV): 25ms
normxcorr2_cuda (CUDA implementation): 1ms

While I’m happy with the results, perhaps there is more optimization to be done. The OpenCV implementation is only single threaded, while the i7 has 8 cores, so theoretically the CPU implementation could get down to 3-4ms.

To breakdown the CUDA timing further:
CPU–>GPU: 0.225ms
Computation: 0.522ms
GPU–>CPU: 0.216ms

Do these look like good numbers? Thanks for any input.

The i7 has four cores but can do two threads per core. You won’t get an 8x speedup, though.

Ah you’re right, 4 physical cores but with hyperthreading 8 logical cores. I wonder if it might go faster if I disabled hyperthreading since the CPU implementations aren’t multithreaded?

Any opinions on the speed of my kernel?

Yeah, I’ve seen almost a 50% benefit to Hyperthreading on the Core i7. Running 8 copies of my code gives about 6x more throughput than 1 copy (ignoring TurboBoost). I think the working set fits quite well into the L2/L3 cache, so it probably doesn’t get any better than this for other code.

i7 HT is quite a bit better than P4 HT, which is nice.

So no comments on the actual run times of my kernel? I guess it’s at least better than anything I’ve seen myself. Accelereyes Jacket and GPUmat have no conv2 let alone normxcorr2 so at the moment my implementation is the only one I’ve seen on the GPU.

Kind of hard to say anything meaningful without code, profiler statistics, etc.

Assuming you’re talking about 8bit grayscale source and template images, 1ms sounds about right (depending on GPU)…

Our NCC kernel runs (on average) 50 NCCs with 9x9 templates against 30x30 source images in about 900us-2ms (depending on system load, scheduler saturation, etc) - including calculation of the peak value coordinates (with subpixel correction) for all 50 NCC results - and our implementation isn’t perfect (though optimizing it further is a difficult task, to say the least).

Which is also roughly the same speed you can achieve with a single core on your average Core2 Duo processor (using SSE of course).

So I’d say you’ve probably reached the best performance you’re going to get, without spending far too much effort/time optimizing it further.

I’d paste the code except it is much too long. I suppose I have a multipart question. Since much of the code is borrowed from the CUDA Vision Workbench (http://openvidia.sourceforge.net/index.php/OpenVIDIA) I’d like to get an opinion as to the quality of the implementation. However in another post no one offered their opinions.

Assuming they performed the memory handling well, then only my modifications are of interest. The following is what I do for each pixel, modified for just a 3x3 kernel so I can write the code in it’s entirety (7x7 case is ~6x as long):

// compute local image mean

uint32 sum=0;

uint16 *p=localAreaAddr;

sum+=*p++;

sum+=*p++;

sum+=*p;

p+=pitch;

sum+=*p++;

sum+=*p++;

sum+=*p;

p+=pitch;

sum+=*p++;

sum+=*p++;

sum+=*p;

const float mean=sum/9.0;

// compute mean-adjusted pixels, adjusted squared sum, and kernel-adjusted product sum

float i2sum=0;

float adj;

float fsum=0;

float *k = kernelAddr;

p=localAreaAddr;

adj=*p++ - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

adj=*p++ - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

adj=*p - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

p+=pitch;

adj=*p++ - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

adj=*p++ - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

adj=*p - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

p+=pitch;

adj=*p++ - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

adj=*p++ - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

adj=*p - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

// compute result and write to shared memory

*sharedMem = fsum*rsqrtf(i2sum,kernel2sum);

CUDA Vision Workbench spends a lot of code on memory handling, and it seems very fast, so I feel like I can trust it. But please do download the code and see for yourself.