Reduce choice

In my case, I will have array A (length N) to store my result. However, for some reason, each element in A needed to be calculated by K(not fixed, but less than 6) thread blocks (to be the sum of one data per block). I just wonder which way can make this kernel fastest?

  1. Use atomicAdd
  2. have a very large buffer (length K * N) to store the result without any conflict. After the launch of first kernel, call another reduce kernel to sum them.

If the second way is better, should I write a kernel manually or use some lib? Which lib can optimize this case best? (as K is relatively small)

So your final result are N values?

  1. You also have - in theory - the third possibility to have 1 actual thread block (possibly in a loop) compute the work of all K thread blocks and store the intermediate sum in registers or shared memory (memory accesses are more performant, also atomic operations in shared memory).

Best try it out for your architecture.

  1. atomicAdd is reasonably fast in recent architectures.
  2. This should be very easy and short to do, even manually, otherwise you could e.g. use Thrust.

It also depends on your requirements: latency vs. bandwidth.

Probably you can program both solutions for adding in 10 minutes altogether.

2b) If you need the result only once in another kernel, you could move the final addition there instead.

Yes, final result are N values. I run my code on A100.
As N can range from 5e4 to 1e7, I don’t know how to make a reasonable split of all N elements and allocate each part to an threadblock. Or, in short, how can I decide how many elements a threadblock should deal with easily? Just tuning for every size?

I don’t understand why atomicAdd might be faster then use another kernel to reduce? Don’t atomicAdd need to fetch data from gmem and write back? The memory access num & compute num are all the same as reduce instruction. It even have overhead of atomic in extra.

Solution 1

atomicAdd (for supported data types) is executed inside the memory architecture. There are small addition engines integrated. So the data is not transferred to the threads and back.

The difference to some of the reduce instructions is that you do not want to use a return value in the thread. Just send the result into the memory architecture.

Solution 2

You could access the array as:
int[N/32][K][32]

That is good for storing and reading. Both are coalesced. And with larger N, the array just gets larger. You could just allocate a sufficiently large amount, e.g. 1 GiB.

int[K][N] would also work, but be more cumbersome for changing N.

You can use any other type than int, even 8-byte types.

The data type is half. In my first kernel, each threadblock will deal with 128 * 128B data. (for each 128B, they are consecutive. But for different 128B, there are no promise)

atomicAdd supports half2. That optimizes the memory accesses compared to half.

Link to the chapter:

If you do not know beforehand where the 128B blocks go (especially to which other 128B blocks they are added), solution 2 could be more complicated. Just start with solution 1. For determining the optimization potential without atomics, you could implement solution 2, as if the blocks would be consecutive and deterministic. That would be an upper bound for the speed of solution 2.

Important

Also consider, whether you want to add up half values. The numerical errors of floating point data types will accumulate. You could convert into (at least) float.

Normally to avoid numerical errors, one could add up numbers like a tree. That is possible with solution 2.
(Even better would be a sorted tree. But let’s not overdo it.)
atomicAdd would add up linearly instead.

I don’t understand the meaning of int[N/32][K][32]? Do you mean I should make each warp just sum up 32*K elements?

A warp could store 32 half2 elements (=64 half elements) with one instruction. That would be the 128 bytes you mentioned.

It could store additional data in further instructions or loop iterations.

(It is not the only way to partition the memory array.)

For summing up, each warp could do 32 half2 summed up K times. Or could do more in a (e.g. grid-stride) loop.

I mean, I can’t decide how many threads a threadblock should have for this operation. Also, how many element a single threadblock should deal with is also a problem. I think this might be a very simple question, but I don’t have tutorial for it.

You mean for solution 2 and the adding up part?
Each addition of K elements is independent.
You could use the maximum amount of possible threads. E.g. 2 thread blocks per SM. And 1024 threads per block. I think the A100 had a maximum of 2048 threads per SM?

But probably the problem is memory bound and less threads would give the same performance. Use less threads, if you can do other compute work (other kernels) in parallel.

Yes, I mean solution 2. I don’t have other compute work as reduce instruction have to be executed after the first kernel complete. Do I still need to use shared memory? It seems like have no use?

No, for the simplest solution you do not need shared memory.

Within a loop each thread adds up its half2 values.

That is done in a single summation register.

Very trivial.

I would recommend to convert to float after each load for adding up and convert back at the end of the loop.

More involved:

If you want to do the tree bracket addition (with half or with float), you could either pragma unroll the loop and use many registers, perhaps 7 or 8 per thread, done with a small array (or use shared memory).

Its like in sports with quarter final, half final and final. You can compute the indices in memory and in the local array. After unrolling all indices are done at compile time.

This tree bracket is officially called pairwise summation or cascade summation. I could not remember the term at first.

But as K is relatively small, does tree bracket addition can reduce any computation?

Cascade Summation

It is not for reducing computation, but for increasing accuracy (reducing rounding errors). Your sum is larger than the single elements. And so the floating bits of the summation (11 bits for FP16) are used as high bits (for the current sum) and as low bits (for the new value to be added). So the low bits are dropped. Whereas in the cascade summation about equally large values are summed up.

Cascade summation will have equal computational effort than linear summation, but higher precision.

Recommendation: Linear Float Summation

I would just linearly sum up with float values and conversion.

That could increase the computational effort, as you need conversion. Float summation could be faster (native size of 32 bits) or slower (more bits) than half summation.

But it is simple and effective. FP24 has 24 mantissa bits, FP16 has 11 mantissa bits (each time including the implicit 1 before the decimal). That is more than enough for small K. (At least as long as your values do not typically cancel out. 2^30 + 1 - 2^30 = 1. But you would need 31 bits to compute that correctly.)

If you say, numerical precision is no problem, you can just sum up linearly with half2.

I think the precision is not a problem. I just need to realize the same precision of

output.sum(0);

which is a way to regard result array as a 2 dim tensor and use tensor api.

is this Python? From Wikipedia:

Pairwise summation is the default summation algorithm in NumPy[9] and the Julia technical-computing language

yes, an api from pytorch

You can always optimize precision later.

However, why not convert to float during the (linear) summation?

Your summation speed is likely memory bound, and the local conversions will probably not make a difference.

So you load half values, convert them, add them and convert back to half and store the sum.

float sumx = 0;
float sumy = 0;
n0 = threadIdx.x; // 0..31

for (int k = 0; k < 100; k++) { // e.g. K=100
    half2 temp = memory[n1][k][n0];
    sumx += temp.x;
    sumy += temp.y;
}

half2 result { sumx, sumy };