Perform uncoalesced write or perform in-warp stream compaction before write?


as a result of some heavy computation (matrix inversion), we’re computing a lot of 32 element wide vectors that are expected to have only a few significant values each. To save storage space and memory bandwidth we intend to store a sparse representation in a large result buffer, with the vector elements being indexed and also quantized.

Each warp of a thread block basically computes one such vector where the complex elements A B C D E in the example given below are of a magnitude larger than a given threshold and the remaining elements of the vector would be considered negligible (i.e. 0)

0 0 A 0 0 0 0 B 0 0 0 0 0 0 0 C 0 0 0 0 0 0 0 0 0 D 0 0 0 0 0 E

after a stream compaction this vector could be sparsely described as:

(each entry is neatly packed into a uint32_t type, with the complex elements being separated into magnitude and phase and quantized)

Now the question is whether we can afford to bite the bullet and perform an an uncoalesced write into global memory (thread 2 writes 2:A, thread 7 writes 7:B, thread 15 writes 15:C and so on…) because the kernel is heavily compute bound anyways.

Would it even be worth the effort to do a stream compaction within a warp, followed by a coalesced write afterwards? Implementing the stream compaction isn’t entirely trivial and I don’t really want to implement this just to find out that it is not any faster.

What’s your gut feeling about this?

Also, would anyone here have a code example of how to do in-warp stream compaction entirely with warp shuffles? So far I’ve only ever implemented this with shared memory.

Also an interesting issue is the necessary concatenation of per-warp and per thread block results into the result buffer. As each warp will generate a variable number of significant values, the position at which later warps and thread blocks have to write their results will depend on all previous warp’s results. This is a tricky problem that requires some pretty delicate buffer locking mechanism using global atomics.


Writes are “fire and forget” operations, so coalescing for writes is much less critical than for reads. Since this solution is trivial to code, I would propose giving that a try and looking into the compaction approach only if the performance at app level isn’t acceptable.

Thanks for the input regarding coalescing for writes.

Slight change of plan. I think I will perform an in-warp bitonic sort of the coefficients by magnitude.

This allows me to choose a power based cutoff criterion where I can preserve a given percentage of the total power in the vector (this vector is a precoding vector for an antenna system, by the way). The bitonic sorting step also performs the intended stream compaction.

For the summation of the powers I can use a parallel scan primitive, and the cutoff is simply made at the lane ID where the total power exceeds a given percentage of the sum power in the last lane (the total).

I had to make a trip to Wikipedia to find out that “bitonic sort” ~= sorting network. Yes, sorting networks are great for sorting short vectors in parallel and thus a good match for the GPU. A sorting network for 32 elements seems a bit daunting to me, however, as I have only used them for up to 10 items or so, using known optimal sorting networks. I guess the regularity of construction in the bitonic approach makes sorting 32-tuples feasible?

Optimally sorting 32 elements on a sequential processor (or in hardware) requires only 191 compare-exchanges across 15 stages. That’s a Batcher sort.

Unfortunately, optimal sorting networks aren’t always regular and easily mappable to SIMT.

A symmetric SIMT-friendly sorting network is a Bitonic sort. It’s 240 comparisons and also requires only 15 stages.

In this case, the number of stages is what matters when considering a warp-centric implementation since that is how many warp-wide consecutive SHFL ops will be needed.