as a result of some heavy computation (matrix inversion), we’re computing a lot of 32 element wide vectors that are expected to have only a few significant values each. To save storage space and memory bandwidth we intend to store a sparse representation in a large result buffer, with the vector elements being indexed and also quantized.
Each warp of a thread block basically computes one such vector where the complex elements A B C D E in the example given below are of a magnitude larger than a given threshold and the remaining elements of the vector would be considered negligible (i.e. 0)
0 0 A 0 0 0 0 B 0 0 0 0 0 0 0 C 0 0 0 0 0 0 0 0 0 D 0 0 0 0 0 E
after a stream compaction this vector could be sparsely described as:
(each entry is neatly packed into a uint32_t type, with the complex elements being separated into magnitude and phase and quantized)
Now the question is whether we can afford to bite the bullet and perform an an uncoalesced write into global memory (thread 2 writes 2:A, thread 7 writes 7:B, thread 15 writes 15:C and so on…) because the kernel is heavily compute bound anyways.
Would it even be worth the effort to do a stream compaction within a warp, followed by a coalesced write afterwards? Implementing the stream compaction isn’t entirely trivial and I don’t really want to implement this just to find out that it is not any faster.
What’s your gut feeling about this?
Also, would anyone here have a code example of how to do in-warp stream compaction entirely with warp shuffles? So far I’ve only ever implemented this with shared memory.
Also an interesting issue is the necessary concatenation of per-warp and per thread block results into the result buffer. As each warp will generate a variable number of significant values, the position at which later warps and thread blocks have to write their results will depend on all previous warp’s results. This is a tricky problem that requires some pretty delicate buffer locking mechanism using global atomics.