Reduce choice

half-0 · March 23, 2025, 9:44am

In my case, I will have array A (length N) to store my result. However, for some reason, each element in A needed to be calculated by K(not fixed, but less than 6) thread blocks (to be the sum of one data per block). I just wonder which way can make this kernel fastest?

Use atomicAdd
have a very large buffer (length K * N) to store the result without any conflict. After the launch of first kernel, call another reduce kernel to sum them.

half-0 · March 23, 2025, 9:45am

If the second way is better, should I write a kernel manually or use some lib? Which lib can optimize this case best? (as K is relatively small)

Curefab · March 23, 2025, 1:38pm

So your final result are N values?

You also have - in theory - the third possibility to have 1 actual thread block (possibly in a loop) compute the work of all K thread blocks and store the intermediate sum in registers or shared memory (memory accesses are more performant, also atomic operations in shared memory).

Best try it out for your architecture.

atomicAdd is reasonably fast in recent architectures.
This should be very easy and short to do, even manually, otherwise you could e.g. use Thrust.

It also depends on your requirements: latency vs. bandwidth.

Probably you can program both solutions for adding in 10 minutes altogether.

2b) If you need the result only once in another kernel, you could move the final addition there instead.

half-0 · March 23, 2025, 1:43pm

Yes, final result are N values. I run my code on A100.
As N can range from 5e4 to 1e7, I don’t know how to make a reasonable split of all N elements and allocate each part to an threadblock. Or, in short, how can I decide how many elements a threadblock should deal with easily? Just tuning for every size?

half-0 · March 23, 2025, 1:49pm

I don’t understand why atomicAdd might be faster then use another kernel to reduce? Don’t atomicAdd need to fetch data from gmem and write back? The memory access num & compute num are all the same as reduce instruction. It even have overhead of atomic in extra.

Curefab · March 23, 2025, 1:53pm

Solution 1

atomicAdd (for supported data types) is executed inside the memory architecture. There are small addition engines integrated. So the data is not transferred to the threads and back.

The difference to some of the reduce instructions is that you do not want to use a return value in the thread. Just send the result into the memory architecture.

Solution 2

You could access the array as:
int[N/32][K][32]

That is good for storing and reading. Both are coalesced. And with larger N, the array just gets larger. You could just allocate a sufficiently large amount, e.g. 1 GiB.

int[K][N] would also work, but be more cumbersome for changing N.

You can use any other type than int, even 8-byte types.

half-0 · March 23, 2025, 1:57pm

The data type is half. In my first kernel, each threadblock will deal with 128 * 128B data. (for each 128B, they are consecutive. But for different 128B, there are no promise)

Curefab · March 23, 2025, 2:00pm

atomicAdd supports half2. That optimizes the memory accesses compared to half.

Link to the chapter:

If you do not know beforehand where the 128B blocks go (especially to which other 128B blocks they are added), solution 2 could be more complicated. Just start with solution 1. For determining the optimization potential without atomics, you could implement solution 2, as if the blocks would be consecutive and deterministic. That would be an upper bound for the speed of solution 2.

Important

Also consider, whether you want to add up half values. The numerical errors of floating point data types will accumulate. You could convert into (at least) float.

Normally to avoid numerical errors, one could add up numbers like a tree. That is possible with solution 2.
(Even better would be a sorted tree. But let’s not overdo it.)
atomicAdd would add up linearly instead.

half-0 · March 23, 2025, 2:31pm

I don’t understand the meaning of int[N/32][K][32]? Do you mean I should make each warp just sum up 32*K elements?

Curefab · March 23, 2025, 2:45pm

A warp could store 32 half2 elements (=64 half elements) with one instruction. That would be the 128 bytes you mentioned.

It could store additional data in further instructions or loop iterations.

(It is not the only way to partition the memory array.)

For summing up, each warp could do 32 half2 summed up K times. Or could do more in a (e.g. grid-stride) loop.

half-0 · March 23, 2025, 2:48pm

I mean, I can’t decide how many threads a threadblock should have for this operation. Also, how many element a single threadblock should deal with is also a problem. I think this might be a very simple question, but I don’t have tutorial for it.

Curefab · March 23, 2025, 2:53pm

You mean for solution 2 and the adding up part?
Each addition of K elements is independent.
You could use the maximum amount of possible threads. E.g. 2 thread blocks per SM. And 1024 threads per block. I think the A100 had a maximum of 2048 threads per SM?

But probably the problem is memory bound and less threads would give the same performance. Use less threads, if you can do other compute work (other kernels) in parallel.

half-0 · March 23, 2025, 2:57pm

Yes, I mean solution 2. I don’t have other compute work as reduce instruction have to be executed after the first kernel complete. Do I still need to use shared memory? It seems like have no use?

Curefab · March 23, 2025, 3:02pm

No, for the simplest solution you do not need shared memory.

Within a loop each thread adds up its half2 values.

That is done in a single summation register.

Very trivial.

I would recommend to convert to float after each load for adding up and convert back at the end of the loop.

More involved:

If you want to do the tree bracket addition (with half or with float), you could either pragma unroll the loop and use many registers, perhaps 7 or 8 per thread, done with a small array (or use shared memory).

Its like in sports with quarter final, half final and final. You can compute the indices in memory and in the local array. After unrolling all indices are done at compile time.

This tree bracket is officially called pairwise summation or cascade summation. I could not remember the term at first.

half-0 · March 23, 2025, 3:17pm

But as K is relatively small, does tree bracket addition can reduce any computation?

Curefab · March 23, 2025, 3:23pm

Cascade Summation

It is not for reducing computation, but for increasing accuracy (reducing rounding errors). Your sum is larger than the single elements. And so the floating bits of the summation (11 bits for FP16) are used as high bits (for the current sum) and as low bits (for the new value to be added). So the low bits are dropped. Whereas in the cascade summation about equally large values are summed up.

Cascade summation will have equal computational effort than linear summation, but higher precision.

Recommendation: Linear Float Summation

I would just linearly sum up with float values and conversion.

That could increase the computational effort, as you need conversion. Float summation could be faster (native size of 32 bits) or slower (more bits) than half summation.

But it is simple and effective. FP24 has 24 mantissa bits, FP16 has 11 mantissa bits (each time including the implicit 1 before the decimal). That is more than enough for small K. (At least as long as your values do not typically cancel out. 2^30 + 1 - 2^30 = 1. But you would need 31 bits to compute that correctly.)

If you say, numerical precision is no problem, you can just sum up linearly with half2.

half-0 · March 23, 2025, 3:28pm

I think the precision is not a problem. I just need to realize the same precision of

output.sum(0);

which is a way to regard result array as a 2 dim tensor and use tensor api.

Curefab · March 23, 2025, 3:29pm

is this Python? From Wikipedia:

Pairwise summation is the default summation algorithm in NumPy [9] and the Julia technical-computing language

half-0 · March 23, 2025, 3:33pm

yes, an api from pytorch

Curefab · March 23, 2025, 3:34pm

You can always optimize precision later.

However, why not convert to float during the (linear) summation?

Your summation speed is likely memory bound, and the local conversions will probably not make a difference.

So you load half values, convert them, add them and convert back to half and store the sum.

float sumx = 0;
float sumy = 0;
n0 = threadIdx.x; // 0..31

for (int k = 0; k < 100; k++) { // e.g. K=100
    half2 temp = memory[n1][k][n0];
    sumx += temp.x;
    sumy += temp.y;
}

half2 result { sumx, sumy };

Topic		Replies	Views
Different results using diffrent memory types CUDA Programming and Performance	22	4417	April 14, 2010
how to improve float array summation precision and stability? CUDA Programming and Performance	9	3453	January 15, 2019
Reduction Reduction Reduction................. Precision Confusion Race Condition...... HELP! CUDA Programming and Performance	16	10475	December 8, 2009
atomicAdd Kahan Summation CUDA Programming and Performance	3	4248	March 16, 2015
Code optimization how to optimize CUDA Programming and Performance	6	1072	June 1, 2011
atomicAdd support Double Which card? CUDA Programming and Performance	44	38297	December 9, 2010
Accumulation Difference between CUDA and CPU? CUDA Programming and Performance	7	10481	August 22, 2009
Iteration help in CUDA CUDA Programming and Performance	11	6908	April 19, 2012
Help with memory management CUDA Programming and Performance	20	5767	March 27, 2010
compute segmented sum using CUDA CUDA Programming and Performance	8	2464	January 3, 2018