Please comment on timings for my normalized cross correlation kernel I'd like to know if these a

joshtp · July 2, 2009, 6:41pm

I have a need to apply a 7x7 template to a 640x512 16-bit image several million times and need to speed it up as much as possible. To start with I was using the normalized cross correlation (NCC) function that Matlab provides, normxcorr2. This was deadly slow. I found a faster implementation called normxcorr2_mex that Daniel Eaton wrote which is basically a Matlab wrapper to the OpenCV library’s implementation of the NCC (it can be found on his website here [url=“http://www.cs.ubc.ca/~deaton/remarks_ncc.html”]http://www.cs.ubc.ca/~deaton/remarks_ncc.html[/url]).

The speed still wasn’t suitable for real-time or better analysis, so I sought to implement my first CUDA kernel to perform NCC. As a base I took OpenVIDIA’s CUDA Vision Workbench ([url=“Best Open Source Mac Software 2022”]Best Open Source Mac Software 2022) and modified their 16-bit 7x7 convolution kernel to do the NCC.

The results on a Intel Core i7 965 Extreme and GTX 295
normxcorr2 (native Matlab implementation): 83ms
normxcorr2_mex (Matlab wrapper to OpenCV): 25ms
normxcorr2_cuda (CUDA implementation): 1ms

While I’m happy with the results, perhaps there is more optimization to be done. The OpenCV implementation is only single threaded, while the i7 has 8 cores, so theoretically the CPU implementation could get down to 3-4ms.

To breakdown the CUDA timing further:
CPU–>GPU: 0.225ms
Computation: 0.522ms
GPU–>CPU: 0.216ms

Do these look like good numbers? Thanks for any input.

tmurray · July 2, 2009, 6:42pm

The i7 has four cores but can do two threads per core. You won’t get an 8x speedup, though.

joshtp · July 2, 2009, 7:05pm

Ah you’re right, 4 physical cores but with hyperthreading 8 logical cores. I wonder if it might go faster if I disabled hyperthreading since the CPU implementations aren’t multithreaded?

Any opinions on the speed of my kernel?

seibert · July 2, 2009, 7:17pm

Yeah, I’ve seen almost a 50% benefit to Hyperthreading on the Core i7. Running 8 copies of my code gives about 6x more throughput than 1 copy (ignoring TurboBoost). I think the working set fits quite well into the L2/L3 cache, so it probably doesn’t get any better than this for other code.

tmurray · July 2, 2009, 8:11pm

i7 HT is quite a bit better than P4 HT, which is nice.

joshtp · July 8, 2009, 2:11am

So no comments on the actual run times of my kernel? I guess it’s at least better than anything I’ve seen myself. Accelereyes Jacket and GPUmat have no conv2 let alone normxcorr2 so at the moment my implementation is the only one I’ve seen on the GPU.

tmurray · July 8, 2009, 2:16am

Kind of hard to say anything meaningful without code, profiler statistics, etc.

Smokey · July 8, 2009, 3:20am

Assuming you’re talking about 8bit grayscale source and template images, 1ms sounds about right (depending on GPU)…

Our NCC kernel runs (on average) 50 NCCs with 9x9 templates against 30x30 source images in about 900us-2ms (depending on system load, scheduler saturation, etc) - including calculation of the peak value coordinates (with subpixel correction) for all 50 NCC results - and our implementation isn’t perfect (though optimizing it further is a difficult task, to say the least).

Which is also roughly the same speed you can achieve with a single core on your average Core2 Duo processor (using SSE of course).

So I’d say you’ve probably reached the best performance you’re going to get, without spending far too much effort/time optimizing it further.

joshtp · July 8, 2009, 3:33pm

I’d paste the code except it is much too long. I suppose I have a multipart question. Since much of the code is borrowed from the CUDA Vision Workbench (http://openvidia.sourceforge.net/index.php/OpenVIDIA) I’d like to get an opinion as to the quality of the implementation. However in another post no one offered their opinions.

Assuming they performed the memory handling well, then only my modifications are of interest. The following is what I do for each pixel, modified for just a 3x3 kernel so I can write the code in it’s entirety (7x7 case is ~6x as long):

// compute local image mean

uint32 sum=0;

uint16 *p=localAreaAddr;

sum+=*p++;

sum+=*p++;

sum+=*p;

p+=pitch;

sum+=*p++;

sum+=*p++;

sum+=*p;

p+=pitch;

sum+=*p++;

sum+=*p++;

sum+=*p;

const float mean=sum/9.0;

// compute mean-adjusted pixels, adjusted squared sum, and kernel-adjusted product sum

float i2sum=0;

float adj;

float fsum=0;

float *k = kernelAddr;

p=localAreaAddr;

adj=*p++ - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

adj=*p++ - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

adj=*p - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

p+=pitch;

adj=*p++ - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

adj=*p++ - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

adj=*p - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

p+=pitch;

adj=*p++ - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

adj=*p++ - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

adj=*p - mean;

i2sum+=adj*adj;

fsum+=*k++ * adj;

// compute result and write to shared memory

*sharedMem = fsum*rsqrtf(i2sum,kernel2sum);

CUDA Vision Workbench spends a lot of code on memory handling, and it seems very fast, so I feel like I can trust it. But please do download the code and see for yourself.

Topic		Replies	Views
2D cross correlation CUDA Programming and Performance	11	26081	May 19, 2011
Same kernel 3x slower on CUDA than on OpenCL Nsight Compute cuda	7	981	May 5, 2023
Converting a kernel from floats and ints to halfs is 6x slower CUDA Programming and Performance cuda	14	1051	October 16, 2023
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	686	April 4, 2017
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	261	July 7, 2024
Optimizing memory coalescence doesn't make my program faster CUDA Programming and Performance	12	491	August 6, 2021
Kernel is slow - don't know why CUDA Programming and Performance	4	3181	May 11, 2012
Kernel Convolution with streams provides no benefit CUDA Programming and Performance	4	42	January 20, 2025
Processing pictures, low load efficiency..? CUDA Programming and Performance	3	530	April 10, 2018
Phenomenal Speed-up! CUDA Programming and Performance	13	10629	November 13, 2009

Please comment on timings for my normalized cross correlation kernel I'd like to know if these a

Related topics