Will Shader-Execution-Reorder feature be supported for CUDA kernels?

For example, during decoding a Huffman-Encoded data, some threads have different number of bits to decode in a block of threads. This causes warp divergence. To fix divergence, data needs to be reordered but then coalescedness is lost. To keep both coalescedness and less divergence, threads could be reordered. Something like this:

  • Get binary data
  • Reorder
  • Decode with less warp divergence
  • Reorder
  • Write result coalesced

Is this possible without using any explicit sorting like CUB? For some decoder code and a high number of duplicated characters, there is only 50% of vram bandwidth during decoding. To approach 100% bandwidth of vram, I think SER feature is required.

Can you give some numbers? What are the ranges of bit sizes? How many codes? Is the code hard-coded or dynamic?

Loading the data from global memory to the SMs should be done in a coalesced way (or with asynchronous copy to shared memory). So VRAM bandwidth should not be compromised.

How do you decode? With Look-up-Tables?

Even though its a very simple decoding algorithm (~30 lines of code), I can’t openly write it here as it will become Nvidia’s intellectual property later but the performance is this:

50% bandwidth for fully duplicated few characters of 500MB.

25% bandwidth for 4 different characters like genome sequence.

12% bandwidth for a normal text from a web site but duplicated until 500MB.

This is with coalesced read/write of input/outputs. So the problem is not memory operations but the overall latency added by warp divergence before the memory operations.

After checking Nsight-compute, the lowest hanging fruit is the warp divergence and solving it adds coalescing problems & sorting latency (and shared memory consumption from sorting algorithm like CUB does for radix which adds occupancy issues). SER would evade all the occupancy/latency/coalescing issues imho.

Test device is rtx4070 so a normal text goes close to 60GB/s decode bandwidth, which is result of divergence problem. I hope to keep the codes very short (like 30-40 lines of codes) without losing performance. SER would add performance for only 1 line of code (or 2 for twice reordering).

It’s the simplest Huffman Tree with no more than 7 depth for text from web. But may increase to 32 depth for unbalanced input. (assuming 1-2 GB maximum size of string to decode)

Thank you.

Do you have to use global or local memory for the decoding or only for reading the input and writing the decoded string?

If the depth goes up to 32 then the tree could no longer reside in registers or shared memory?

I’m reading 150 MB input encoded, writing 500-600 MB output decoded string of text. This has to be inside global memory just because of size. Tree depth doesn’t change size requirement of registers/shared memory. I always allocate for maximum possible amount required (i.e. 256 depth) which fits fine into in-chip memories.

With in-chip memories you mean inside the SM (= registers and shared memory)? Or into global device memory?

Or depth 256 means you reserve space for 2^256 entries?