Is there a Map-Reduce feature/sdk in CUDA Say I want to compute the sum of differences between to ma

I am interested in calculating the PSNR between 2 images. Which is practically the “sum” of differences between 2 images. Rather than computing the difference and the serially going over the difference matrix, and calculating the sum, it would be nice to let the GPU do this in stages. For each block it would calculate the difference, then creates a new matrix made up of the differences calculated and goes ahead does this again and again, until there is one element which is the total sum.
Does anybody know of such a feature/sdk?

Ehh, a kernel that substracts the two followed by a reduction would do just that?

The first one is just matrix addition (with negative sign)

The second one is in the SDK.

Where exactly in the sdk is the reduction located?

And for the first part(matrix subtraction)did you mean that all I need to do is write a kernel that subtracts matrices?

Thanks for the help,


The reduction example is in: \NVIDIA CUDA SDK\projects\reduction

It’s templated so you can easily incorporate it into whatever you need - just copy-paste or simply include.

Yes, you write a simple kernel to get the difference. One thread = one pixel.

If your images are allocated as 1D arrays of pixels and they are single channel, it might be as easy as

__global__ void getDifference(float imageA[], float imageB[], float difference[])


	unsigned int tid = blockIdx.x*blockDim.x + threadIdx.x;

	difference[tid] = imageA[tid]*imageA[tid] - imageB[tid]*imageB[tid]; //squared


Colour images are trivial as well, you do the above for all channels obviously.

Then you take the difference array, do a reduction over it (using the summation operator) and finally divide by the total number of pixels - that’s your MSE and it’s downhill from there.

Would you believe: in the directory “reduction”. It even comes with a nice pdf (in the reduction/doc directory) that explains everything in detail.

I’m sure he meant that you can subtract and then reduce in the same kernel. There is no need to subtract and dump to slow global memory with just a simple subtraction and relaunch a new kernel only to reload those same values from memory.

10x a lot.
Surprisingly, the CUDA forums are some of the best I’ve seen.
People here a great help.

Well make sure you dont read from global memory twice. Read imageX[tid] once into a register and square it after.

Sorry to reanimate this old thread… I’ve just started to look at the reduce sample for use in my code.

Has anyone improved the SDK reduction sample so it can deal with non-power-of-two input size, or is that still left as an exercise for the reader?

Also the CPU side of the reduce sample is a little convoluted; it’s buried in the runTest function rather than being directly callable. I wonder if anyone has “productized” it a little more.

You might want to take a look at thrust. It is geared at being more developer-friendly than some of the SDK examples.

Thanks, Gregory! Thrust looks really easy to use. I’m checking it out now. Is the reduce performance supposed to be the comparable? Are there any benchmarks around? (I know it’s early days for this kind of thing yet, but one never knows…)

I wrote a map-reduce kernel based upon the SDK reduction example and the THRUST version was just as fast and WAY easier to use. I highly recommend looking into THRUST.