Paralel Reduction With less than 8000 values

Good Morning,
I have finished porting a few functions to CUDA and have found the following:
A section of code that is dedicated to perform a paralel reduction is consuming 47% of the total execution time according to the visual cuda Profiler.
The problem is as follows: I have a 6000x65 Matrix, and I have to sum the values of all the columns. I have done this extremely naively because I have seen that the paralel reduction with few values (6000) doesnt perform well, so my extremely naive solution consists of the following:
65 threads that perform f(i=0;i<6000;i++)sum+=value[i];result=sum;
I have tried integrating cudpp, but cant get it to work with my matlab mexfile with cuda.
Any advice, or way to accelerate things?
Thanks in advance, David


What are your kernel configuration???

runKernel<<< 1, 65 >>( … ) ???

this is way under-utilizing the GPU. more over you’re probably not coalesced if you have this for in your kernel code.

I suggest you re-read the Programming guide regarding coallescing and how to access memory efficiently on the GPU.

The SDK also contains a very detailed reduction sample + a PDF with it.

hope that helps…


Good afternoon,

I have used the reduction example given in the SDK but do not get a speedup because values are few (6000x65).

The times on each function are:

0.18 [salidaSumaAcumuladaC]= sumaAcumuladaC(betaPc);

1.72 [salidaSumaAcumuladaCUDA]= sumaAcumuladaCUDA(betaPcuda);

2.09 [salidaSumaAcumuladaCUDAMatriz]= sumaAcumuladaReduccionMatriz(betaPcuda);

2.30 [salidaSumaAcumuladaReductionCUDAStream]= sumaAcumuladaReduccionStream(betaPcuda);

sumaAcumuladaC is a Mex File written in C.

sumaAcumuladaCUDA is a Mex File in CUDA written as mentioned above

sumaAcumuladaReduccionMatrizm & sumaAcumuladaReduccionStream are both written using the reduction example in the cuda SDK, one using streams and the other not.

What I do is:

for (int i=0; i<numColumns;i++){

     do reduction on each row.


But as can be seen, there is no speedup. That is why the first version in cuda is better, although totaly unoptimized…

When tested on vectors of 6000000, the speedup is marvelous on reduction cases, but as i said before, I have few elements…

I will upload the code in case anyone sees anything wrong.

Thanks in advance, and please, if anyone can help to perform paralel reduction on matrixes of 6000x65 for example, I would be extremely greatful :)

Thanks again, David (9.99 KB)

Please, anyone? It just seems a pity that this function as is, takes 48 % of all computation time…

There’s an interesting explanation of how to optimize a pararalel reduction here. Maybe you should change it to your particular needs.

Goodafternoon Timon,

my solution is based on this example using reduction version 6. It works perfectly, and is extremely fast with large data sets, but as Ive said before, with only 6000 elements, it is far from optimal. The source code provided shows how reduction6 kernel has been used.

Has anyone had any experience performing reduction with less than 6000 elements with any speedup???

Thanks again to everyone, David

The obvious thing to do is use more than a single block and 65 threads (which is probably about the worst possible choice). You want columnwise sums. So why not use a block with a sensible number of threads (say 128) per column rather than a single thread and perform a “proper” parallel reduction on each column that way? Memory reads can be coalesced and occupancy will be much better. I would be surprised if it took more than a millisecond.

I’m not sure if that’s the best solution, but I would call 65 blocks that reduce 6000 elements each one.
For doing that you just have to write and call a single kernel.
At the beginning it would be serial reduction for each thread.

I think this is basically what avidday is saying

Hi avidday, as indicated in previous thread, i have used the reduction example provided, which launches 128 threads in 6000/128 blocks to perform the paralel reduction as reduction6 kernel in sdk indications, but as indicated, still seems slow compared with CPU version. With 1000000 elements, the paralel reduction is impressive, but with 6000, i get the following times:

CPU time: Elapsed time is 0.000075 seconds.

Naive time: Elapsed time is 0.016455 seconds.

Paralel reduction sdk example time: Elapsed time is 0.005071 seconds.

Yes, i have a speedup of x3, over my original code, but i still need to get the whole matrix in, and get results right. Once I get everything ok, i will keep everyone informed. Thanks in advance, David

what i proposed (and i think avidday too) is not just calling more threads, but calling more blocks.

What you’re doing (and correct me if I’m wrong) is calling the kernel 65 times (one per column).

But what you should do is call the kernel just once, with 65 blocks. You don’t have to know the number of threads while writing the kernel.

The kernel would by something like this (i didn’t test it nor compile it):

template <unsigned int blockSize>

__global__ void matrixReduction(double** inputMatrix, double* outputVector, unsigned columnSize)


	extern __shared__ double sdata[];

	unsigned tid = threadIdx.x;

	unsigned column = blockIdx.x;

	double result = 0;

	unsigned i = tid;

	while (i < columnSize){

		result += inputMatrix[column][i];

		i += blockSize;


	sdata[tid] = result;


	if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }

	if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }

	if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }

	if (tid < 32) {

		if (blockSize >= 64) sdata[tid] += sdata[tid + 32];

		if (blockSize >= 32) sdata[tid] += sdata[tid + 16];

		if (blockSize >= 16) sdata[tid] += sdata[tid + 8];

		if (blockSize >= 8) sdata[tid] += sdata[tid + 4];

		if (blockSize >= 4) sdata[tid] += sdata[tid + 2];

		if (blockSize >= 2) sdata[tid] += sdata[tid + 1];


	if (tid == 0) outputVector[column] = sdata[tid]


Then you call the kernel with 65 blocks and you can try different number of threads per block (512, 128 …).

I’m working with neural nets and I’m doing something similar (one Block per output neuron), so I’m interested in knowing If that’s a good solution.

I had a topic here, but nobody seems to be interested.

I hope this helps you.

I don’t think you understand what I am suggesting, or how the parallel reduction can be used in this case.

This is (I am guessing) your “naïve” approach, which you launch with 1 block:

__global__ void columnsumnaive(const int m, const int lda, 

				double *in, double *out)


	unsigned int idx = __mul24(threadIdx.x, lda);

	double sum	   = in[idx];

	// Loop through the column and sum

	// all the entries. (ssslllloooowwww.....)

	for(int i=1; i<m; i++)

		sum += in[idx+i];

	out[threadIdx.x] = sum;


and this is a shared memory parallel summation, with in-kernel reduction in shared memory which should be launched with one block per column:

__global__ void columnsumreduction(const int m, const int lda, 

				double *in, double *out)


	extern __shared__  double buff[];

	unsigned int tidx   = threadIdx.x;

	unsigned int idx	= tidx + __mul24(blockIdx.x, lda);

	unsigned int idxmax = m + __mul24(blockIdx.x, lda);

	buff[tidx] = 0.;

	// Compute partial sums down the column until we

	// have covered the whole column

	while (idx < idxmax) {

		buff[tidx] += in[idx];

		idx += blockDim.x;



	// Parallel reduction of the partial sum 

	// in shared memory

	if (blockDim.x == 512) {

		if (tidx < 256) 

			buff[tidx] += buff[tidx + 256];



	if (blockDim.x >= 256) {

		if (tidx < 128) 

			buff[tidx] += buff[tidx + 128];



	if (blockDim.x >= 128) {

		if (tidx < 64) 

			buff[tidx] += buff[tidx + 64];




	if (tidx < 32) {

		if (blockDim.x >=  64) {

			buff[tidx] += buff[tidx + 32];


		if (blockDim.x >=  32) {

			buff[tidx] += buff[tidx + 16];


		if (blockDim.x >=  16) {

			buff[tidx] += buff[tidx + 8];


		if (blockDim.x >=   8) {

			buff[tidx] += buff[tidx + 4];


		if (blockDim.x >=   4) {

			buff[tidx] += buff[tidx + 2];


		if (blockDim.x >=   2) {

			buff[tidx] += buff[tidx + 1];




	// write result for this block to global mem 

	if (tidx == 0)

		out[blockIdx.x] = buff[0];



the difference in execution time is between the two of them is more than 40x on my GTX275 (one block with 65 threads for the naïve case, and 65 blocks with 128 threads for the reduction):

avid@cuda:~/code/columnsum$ ./columnsum 

Column summation on a random double precision array

For m=6000 n=65

Reduction summation: time=6.681600e-02 ms, deviation from cpu=0.000000e+00

Naive summation:	 time=2.820704e+00 ms, deviation from cpu=0.000000e+00

That is it takes less than 0.07ms to do your 6000 row x 65 column example in double precision for column-major ordered storage (and I should add there are probably pretty terrible shared memory bank conflicts and serialization in the reduction code I posted).

avidday’s solution is very similar to what i proposed.

Using templates the if conditions are evaluated at compile time and not at running time.

Another difference is that I store the partial sums in a local variable result and he uses the shared memory directly.

I don’t know (and i’m very interested in knowing it, but I’m stuck with my own experiments) if this result variable is stored in a register or in local (global) memory. If it’s stored in global memory, then it’s better to store the partial results directly in shared memory.

I don’t understand these two lines pretty well, avidday:

unsigned int idx	= tidx + __mul24(blockIdx.x, lda);

	unsigned int idxmax = m + __mul24(blockIdx.x, lda);

I assume they’re just related with selecting the column.

In my version, I use a matrix instead of a vector.

I think it’s important to chose between

result += inputMatrix[column][i];


result += inputMatrix[i][column];

I think the first one will get more coalesced accesses. Specially if you declare your matrix dynamically (in the host code) like this:

double** inputMatrix = new double*[65];

for (unsigned i=0; i<65; i++){

inputMatrix[i] = new double[6000];


I’ll wait for your tests and answer, david.

In your version, the partial sum variable should be a register. You can compile with nvcc to ptx and check the output, but there isn’t a reason why it wouldn’t be a register that I can see. The two lines of code you asked about are just the indexing scheme for a given column in pitched linear memory. In practice, I doubt anybody would use 2D C arrays in preference to indexing into 1D memory in linear algebra codes (please don’t call them “vectors” and “matrices” because they aren’t). There are too many overheads associated with allocation and copying (and potentially a second layer of pointer indirection which can be very slow on the GPU).

Good afternoon Avidday,

I have tried your solution in matlab, but the results are not the same, when i increment the number of threads…

Only if I launch 65 blocks with 1 thread, do I get the same results… I am sorry to be a pest, but is it possible to indicate whith what parameters did you launch the kernel?

columnsumreduction<<< numBlocks, numThreads >>>(m,lda,matrizA_gpu, matrizE_gpu);

I Only get the same result when i Lauch 65 blocks with 1 thread, when i try 65 blocks with 128 threads, i dont get the same results. If it where possible to indicate which numBlocks, numThreads, lda and m you are using, I would be greatful, since I have tried the logical values, and they havent worked for me. Thanks avidday for your code, David

Perhaps you missed the fact that there is dynamic shared memory in the kernel. I don’t see any declaration of that in you code snippet above. It should be size of the type (double in this case) times the number of threads per block. Also, the number of threads per block must be a power of 2 for the shared memory reduction to work correctly.

Hi avidday, I have done as said, and it now works with the followong code:

int numThreads=128;

int smemSize = numThreads* sizeof(double);

columnsumreduction<<< nMatrizA, numThreads,smemSize >>>(mMatrizA,mMatrizA,matrizA_gpu, matrizE_gpu);


But the results are slightly different:


salidaC =


Elapsed time is 0.002042 seconds.

salidaCUDAMatrizAvidday =


Elapsed time is 0.002683 seconds.

As you can see, the results differ slightly compared with C version…Even though my numThreads = 128, mMatriz=6000 is not a power of two. So that could be one of the reasons why the result is not exact. Unfortunately the paralel sum is in between various calculations, and I cannot perform padding, which is why i presume you reference lda instead of just working with m. I am sorry to be such a pest, but I just wanted to make sure all presumptions were correct. Thanks a lor for all the help given to respond, David

Hi Timon, I just wanted to say thanks for your help. As avidday said, I only use 1D linear arrrays because it has faster, and is just as simple as using 2D array indexes. Also I am working with matlab, so I get the mxArray pointer from the matlab runtime, and copy that directly to GPU, thus avoiding having to convert to 2D arrays. And the “other” problem why I cant use our coding is because this is part of a series of calculations performed before hand, and aviddays solution is just better suited for my particular needs, although both versions are perfectly valid :)

Thanks a lot for all your help Timon, David

I don’t understand how you are getting different results. I wrote a simple test harness which verified the results were identical between the two kernels I posted and a cpu implementation of the same calculation when I used a matrix filled with random numbers on [-1,1] (the cpu version using x86_64 with sse2 so that the computation uses proper IEEE-754 double precision). If you are getting a different answer, then something is wrong. Try testing the kernels on their own against some known data.

I also question your timings, if they are supposed to be the kernel execution times of the two versions. I used CUDA events to time the kernels (which are supposed to be accurate to about half a micro second), and I get a very repeatable 45 times speed difference between the two.

The power of two requirement has nothing to do with the size of the matrix, only the number of threads per block. If you look at the code, you will see that the partial summation section is capable of processing any length column with any pitch/padding (which is all the lda parameter is for). The power of two requirement only comes into play when the kernel performs the reduction from the completed partial sums in shared memory.

You should use avidday’s version.

But there are two things that are better in mine’s and you can use too.

  1. use templates (blockSize) instead of blockDim.x (just as they do in the reduction pdf and I do in my version) and the if statements will be processed at compile time instead of running time.

  2. use a variable result instead of buff[tidx] directly and you will be using register memory instead of shared for all the loop.

Change this:

buff[tidx] = 0.;

	// Compute partial sums down the column until we

	// have covered the whole column

	while (idx < idxmax) {

		buff[tidx] += in[idx];

		idx += blockDim.x;


for this:

double result = 0;

	// Compute partial sums down the column until we

	// have covered the whole column

	while (idx < idxmax) {

		result += in[idx];

		idx += blockDim.x;


	buff[tidx] = result;

Good luck

Good Morning avvidday, I am working with matlab, and it is where i perform the timing, that is why it takes a lot more, as i am not only timing kernel execution time, but also copies to and from the gpu, and matlab mexfile invocation times, so it is comprehensible that the timing should be much larger.

As for the different results, i am using format long with Matlab, which shows me 16 digits of the results. I am working on Win XP 32 bit, with Matlab 7.5. Yo can see the following results that show the difference. But considering that there may be a last digit rounding (double precision sum: 0 (IEEE-754 round-to-nearest-even)), that may the reason…And when the sum is acumulative, the error may increase. If not, I do not understand the reason for the error…

With your naive version, results are exact. Using the reduction example, results are exact if 65 blocks are launched with 1 thread.

However, When I launch with the following numbers, the error appears:

int numThreads=128;

int smemSize = numThreads* sizeof(double);

columnsumreduction<<< 65, numThreads,smemSize >>>(6000,6000,matrizA_gpu, matrizE_gpu);


If any of the params are wrong, please advice me!!!

On a different note, I have launched (changing above params obviously) with this matrix (3x1)

A=[1.123321312314323423 ;3.32133454653323434245345 ;2.7845612311234234225654];

And results are:


salidaSumaAcumuladaC =


Elapsed time is 0.000288 seconds.

salidaSumaAcumuladaCUDA =


Elapsed time is 0.000645 seconds.

salidaSumaAcumuladaReduccion =


Elapsed time is 0.000158 seconds.

A last push would be great !!! Thanks again avidday, David