Parallel sum reduction 2D

Hello everyone!

I have a kernel with this parameters:
blocksize (64, 1)

I also have a shared memory:
__shared__ float vals[64][36];

What I want to accomplish is very simple: I want to sum all the rows for each column.
So:
thread0 will sum vals[0][0] + vals[1][0] + … + vals[63][0]
thread1 will sum vals[0][1] + vals[1][1] + … + vals[63][1]

(I think you would have noticed that my columns are 36 instead of 64, that’s because I wanted to reduce my shared memory usage based on my specific problem, so I have a “mapping” array that maps the threadidx.x into a correct index in the range [0,35], so my code actually is like this:

 float sum= 0.0f;
 for (int j = 0; j < 64; j++)
     sum+= vals[j][mapping[localId]]; //ignore the "mapping" array
 global_mem[outputIndex] = sum;

Each thread sums all the rows of its corresponding column, and writes the sum in the global memory (outputIndex is a transposed index → (y * width) + x)

I was thinking of optimizing this. I cannot use warp shuffling because I think it would be way too heavy for this problem, correct? I had implemented it, but as expected my compute throughput went to 100% almost! I also tried libraries like “cub” and the result was the same. In this case, is it possible that a simple non-reduction sum is faster? I am thinking there must be another way for this, but I can’t figure it out!

ignoring the “mapping” array, what you have depicted is a column-sum per thread.

That is often/generally most efficiently done with a simple loop per thread. More-or-less as you have depicted. I’m not surprised that you can’t do better than a simple loop.

Certainly the mapping array may impact things (compared to the case where you are actually summing columns) but with no further information, there isn’t any reason to assume you can do any better than the naive column sum, even in that case. Summing with indirect or “mapped” access like this in shared memory has the potential to include bank conflicts, but the problem there is data-dependent, with no obvious method to sort it out in the general case.

1 Like

Thanks for the quick reply! I was really disappointed because I thought there MUST be a way to make this faster haha, so I guess I am already efficient then.

For reference, this is the “mapping” array

__constant__ int mapping[64] =
{
	0,  1,  2,  3,  4,  5,  6,  7,
	1,  8,  9,  10, 11, 12, 13, 14,
	2,  9,  15, 16, 17, 18, 19, 20,
	3,  10, 16, 21, 22, 23, 24, 25,
	4,  11, 17, 22, 26, 27, 28, 29,
	5,  12, 18, 23, 27, 30, 31, 32,
	6,  13, 19, 24, 28, 31, 33, 34,
	7,  14, 20, 25, 29, 32, 34, 35
};

What it does, is it (assuming column-major ordering) maps the index of an “upper diagonal” value into its corresponding “mirrored” lower diagonal index. The reason for this, is that I reduced the shared memory from [64][64] to [64][36] (there is no need to save the upper diagonal values because their value is the same with the “mirrored” in my own problem).

So with a fixed constant pattern in the mapping array, or stated alternatively “summing the columns of a symmetric matrix”, someone may come up with a clever optimization.

1 Like

Hmm I get it. I may try to optimize if I can, thank you for the answers Robert

I would calculate the sum with 36 threads and then resort after that instead of doing the work with 64 threads. Of course only, if you can use that improvement, as 64 = 2 * 32, with 32 threads per warp.

wouldn’t we lose parallelism though?

You only have 36 unique calculations?

yeah, you are right, I have 36 unique calculations, even if I am looping 64 times, some of them (28 in number) are re-calculated, hmm

You can either accept it (perhaps that is not relevant to your final performance) or try to rearrange computations.

E.g. if several blocks are running per SM, e.g. 8 blocks:

Combine them to one block of 8*64 = 512 threads.

Do 8*36 = 9*32 summations with 9 warps (instead of 8*64 with 16 warps).

Of course, you have to put additional effort into it, to prevent shared memory bank conflicts.

The performance gains are: better use of the arithmetic units (for the summation) and better use of shared memory bandwidth. Whether those are bottlenecks, you know better.

thank you for the detailed explanation. I think that would be too much effort, I think I am gonna stick with the simple summation, maybe I am gonna find a way to not calculate again values, the optimization with the mapping is good enough for now and I will try to improve it.