I’m implementing the template matching algorithm described here:


The amount of data bandwidth is quite large and that is my main concern for performance. For each pixel in the result, I need to access all the pixels in the template image and their corresponding pixels in the source image.

My images are grayscale, so that at least reduces the image memory by a third. Assuming my template images are small enough to fit in constant memory is the best strategy as follows:

  1. Keep template in constant memory.
  2. Load corresponding pixels in the source image into shared memory?

If my thread group size is 16x16, and my template image is say 128x128, there will still be a bunch of redundant global memory read overlap across thread groups.

Is there a better strategy? I can get 1080ti if needed.

Image processing is not my area of expertise. You might want to check whether the desired operation is already implemented by NPP.

It is not clear that constant memory is the best place to keep the template image T. Constant memory is optimized for uniform access (single access with broadcast), where each thread in a warp accesses the same location in constant memory. Any deviation from uniform access will incur additional replays, until all threads have received their data. In that case, shared memory would be a better location.

Loading the currently active patch of the of the source image I into shared memory seems the way to go. Since the patch is moved only one pixel at a time, there should be plenty of data re-use while the patch is “cached” in shared memory ring-buffer style.

How high do you estimate the percentage of redundant global memory reads? Maybe I am missing something, but I wouldn’t expect that percentage to be very large, overlap should only at the boundary between image areas worked on by different thread blocks, and caching may mitigate that.

You have a pretty good approach, all the threads in a warp may access the same template image element each clock cycle making constant memory a good fit…


A template size of 128x128 is very large! Would it be possible to reformulate your problem into a convolution and do:

FFT --> Apply template kernel --> IFFT ?

Otherwise you might want to consider have multiple blocks cooperating on a single data block of say 16x16 with a global reduction in the end. With such an approach you would let a single block handle a portion of the template matching operation.


One concern with FFT. I think I can do the CV_TM_CCORR method for matching, but I am not sure how to do the more complicated ones like CV_TM_CCORR_NORMED in the frequency space.

Also, my FFT is a little rusty. If I have two images A and B, and I transform to frequency space FFT(A) and FFT(B), then the convolution is done in frequency space just by multiplying pixels in FFT(A) with corresponding pixels in FFT(B)?

Pretty much.

The simpleCUFFT cuda sample demonstrates this exactly, for 1D convolution.

Hi @PabloBot, I am also trying to implement template matching for GPU, but speed is same as CPU. If you implemented would you mind share some of your code or give some advice how to implement it.

@Pablo: Normalized cross correlation can be done also via FFT and an additional step using integral images. See paper “Fast Template matching”, Lewis, http://citeseerx.ist.psu.edu/viewdoc/dowSonload?doi=