Finding a small image in a large image C#, Cudafy, compute capability 1.2


I’d like to make a Fine Matching algorithm in CUDAFY.

I have a GeForce 210 graphic card with a compute capability of 1.2

So, I copy the large and the small image into the device memory. After it I would like to find the small picture in the large picture.

For example: small picture: 5050, large picture 150150

for the first step I have to get the large picture (0,0) koordinate and the small picture (0,0).

I select an area in the large picture which is 50*50, and I compare the two same sized picture.

than i step to the second pixel at the large picture, which is (1,0), and i make the same algorithm, than (2,0), (3,0)…

The images are in 24bit format (RGB without Alpha).

I need that how many same pixels are at the the large picture (x,y). I try to save it in my histogram. For example: athe koordinate (10,10) we had 2465 same pixel /the maximum is 2500, because 50*50 pixel have the small picture/

My Launch looks like that:

int width = inputLarge.Width - inputSmall.Width; /*(large picture width-small picture width)(pixel)*/;

int height = inputLarge.Height - inputSmall.Height; /*(large picture height-small picture height)(pixel)*/;
dim3 gridDim = new dim3(width, height * inputSmall.Height);

 dim3 blockDim = new dim3(inputSmall.Width); 

gpu.Launch(gridDim, blockDim, "gpuFineMatchLinear", device_small, device_large, device_histogram, width, height, treshold, inputLarge.Width, inputSmall.Width, inputSmall.Height);

        public static void gpuFineMatch(GThread thread, byte[] small, byte[] large, int[] histogram,

            int width,

            int height,

            int treshold, int largeWidth, int smallWidth, int smallHeight)


            //--> position of histogram and large picture

            int hx = thread.blockIdx.x / smallHeight;

            int hy = thread.blockIdx.y;

//--> position of small picture

            int kx = thread.threadIdx.x;

            int ky = thread.blockIdx.x % smallHeight;

//--> Offsetek

            int histoOFF = hy * width + hx;

            int largePicOFF = (hy * 3 * largeWidth) + hx * 3;

            int smallPicOFF = (ky * 3 * smallWidth) + kx * 3;

int same = 0;

int ertek = (int)small[smallPicOFF ] - large[largePic];

            same += (ertek < treshold && ertek > -treshold) ? 1 : 0;

            ertek = (int)small[smallPicOFF + 1] - large[largePic+ 1];

            same += (ertek < treshold && ertek > -treshold) ? 1 : 0;

            ertek = (int)small[smallPicOFF + 2] - large[largePic+ 2];

            same += (ertek < treshold && ertek > -treshold) ? 1 : 0;

            if (same == 3) thread.atomicAdd(ref histogram[histoOFF], 1);


I hope you can answer me.

I have a solution for it, but it needs a lot of CPU, because I make a Clone() picture at every pixel from the large image, and than I use the GPU. So it is very slow.

Sorry for my bad english, i write to you from Hungary.

Thank you!


Your problem description is somewhat similar to a motion vector search in video coding. Also maybe an algorithm called optical flow could give you some results. There are several existing CUDA implementations for this.

Some general advice without looking at your specific code:

From an engineering point of view what you would need to do is computing the correlation coefficient of the image snippet with the complete image for all (x,y) coordinate offsets that produce some overlap. For an ideal match this correlation should be 1. For a “good enough” match it should be close to 1.

For best speed you can do this on a grayscale version of the image (the luminance or Y channel). For even more speed you can do this on a downsampled (low res) version of the image first. Then for the best matches (or candidates) you would perform another refinement on the next higher resolution version. You successively refine the coordinates higher resolution images until you have the exact coordinates. You could even upsample the images to achieve a sub-pixel accuracy.


another thread about your problem:

Have you considered using ArrayFire (a CUDA GPU library) for pulling out the small image from the large image. It’s functions for subscripting and image processing are very good.

Thank You for the answer! :)

Sorry, but it is a special project, i have to use C# and Cudafy.

I know, there are more CUDA GPU library (Thrust, ArrayFire, OpenVidia, GpuCV… )

I wrote some algorithms in Cudafy, for example: Convolution matrix /3x3, 5x5/, Substracting, Fine Matching (uses CPU for cropping the image), Motion detection…

Except Fine Matching, the others work such real time with camera (~25 fps).

I need only, how to index the two picture.

I can use the grid dimesion X and Y (65536*65536) and the block dimension X and Y only. My graphics card doesn’t have the third dimension (Z).

The grid dimension X is the (large picture width - small picture width)=width (blockIdx.x)

The grid dimension Y is the (large picture height - small picture height)=height * small picture height //I had to hire here the small picture height to, because I dont have the third dimesnion: Z (blockIdx.y)

The block dimension X is the samll picture width. (threadIdx.x)

The maximum of block dimension is 512, so the maximum width of small picture can be 512.

I had to hire the height of small picture into the grid dimension Y.

I believe it because of this I can’t get the right indexes of the pictures. :(

After i get the indexes, the program substracts the two picture, fills up the histogram array using atomic thread, and “returns” with the histograms.

The array histogram has values, but not the correct values.

I changed the CUDA code:


        public static void gpuFineMatch(GThread thread, byte[] small, byte[] large, int[] histogram,

            int width /*(large picture width - small picture width)(pixel)*/,

            int height /*(large picture height - small picture height)(pixel)*/,

            int treshold, int largeWidth, int smallWidth, int smallHeight)


            //--> Position of histogram and large picture

            int hx = thread.blockIdx.x;

            int hy = thread.blockIdx.y / smallHeight;

//--> Position of small picture

            int sx = thread.threadIdx.x;

            int sy = thread.blockIdx.y % smallHeight;

//--> Actual position of large picture

            int lx = hx + sx;

            int ly = hy + sy;

//--> Offsets

            int histoOFF = hy * width + hx;

            int largePicOFF = ly * 3 * largeWidth + lx * 3;

            int smallPicOFF = sy * 3 * smallWidth + sx * 3;

int b = (int)GMath.Abs((int)small[smallPicOFF ] - (int)large[largePicOFF ]);

            int g = (int)GMath.Abs((int)small[smallPicOFF + 1] - (int)large[largePicOFF + 1]);

            int r = (int)GMath.Abs((int)small[smallPicOFF + 2] - (int)large[largePicOFF + 2]);

if (b < treshold && g < treshold && r < treshold)


                thread.atomicAdd(ref histogram[histoOFF], 1);




If i find the correct answer for my problem, i am going to try it with grayscale to, and reduce the resolution of pictures. I know it will be more efficient.

Thank You!

Atomics are slow (unless you have a new Kepler GPU device), so it’s best to use a parallel reduction to summate (or logically combine) the per-pixel results.

For parallel reduction there is sample code of various optimization levels in the SDK, which makes use of shared memory. The sample code operates on local 1D arrays of size 512 elements, but your 2D array (images) could be flattened out to a single 1D array (e.g. size 32x16 image pixels could be mapped into 512 flat elements) for reduction.

Here’s a solution for your problem from Roger Dahl, it comes with source code and win64 executables.

It performs a sum of absolute difference for all pixel offsets in a grayscale version and returns the position where this is minimized. Maybe you can learn from the code.


That link does not work. I am looking for an answer to the same general problem. Do you have a different link Christian?