Performance penalty due to warp divergence

Hello,

I am in the process of porting an algorithm for data encryption/decryption to CUDA. For this purpose, I have successfully ported Chacha20 and Poly1305 to CUDA, which fuzz against their reference implementations and perform pretty well already. Now to the tricky part, I’m encrypting/decrypting variable-length packets, whereas both length and encryption or decryption (as a flag) are data-dependant.

__device__ void ProcessPacket(uint8_t* __restrict__ data) {
if (df) {
    poly1305_update(&polyState, &data[64], pHeader+pSize);
    chacha20_ietf_xor(data, data, pSize, data, 1, &data[16]);
}
if (!df) {
    chacha20_ietf_xor(data, data, pSize, data, 1, &data[16]);
    poly1305_update(&polyState, &data[64], pHeader+pSize);
}

The diverge here is clearly with both the (possible) branching for df (decrypt flag) and pSize (packet size).

To validate my assumption of a warp divergence issue, I have benchmarked the code with coherent df flag within one block, which reduced execution time for 100M iterations by nearly 30%. After running with coherent pSize, I identified a whooping 60% performance gain additionally. Lastly, I benchmarked the code with Nsight Compute, which is very unhappy with my code:

This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device.

Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only issues an instruction every 4.8 cycles. This might leave hardware resources underutilized and may lead to less optimal performance. Out of the maximum of 12 warps per scheduler, this kernel allocates an average of 4.12 active warps per scheduler, but only an average of 0.21 warps were eligible per cycle

The warp state graph shows nearly 13.6 cycles just on Stall Wait.

The card in question is a 4090.

I’m new to CUDA so I really could use some help, tips or tricks how I can improve performance. Right now, the warp divergence bottlenecks the code so much that a high-end CPU is performing equally. It’s worth noting that I can’t change the inherent algorithm, so the order of poly1305/chacha20 depending on df is forced (unless I use a buffer that is later passed to poly1305_update).

I compiled with optimization flags obviously, I’m launching 1024 blocks and 256 threads currently. I can not “sort” the data in any way that would allow me to have coherent df or pSize in a thread group. The only thing I can think of is grouping all blocks+threads and then group them somehow. Naturally pSize and df are known before processing.

FWIW, I might be missing some CUDA basics here and the solution might be trivial. I appreciate any pointers in the right direction!

You don’t necessarily have to physically sort the data in order to arrange for reduced divergence. Since you already have flags that indicate the divergent paths, you could do a prefix sum on the flags array to arrange the threads at one end of your grid to handle flag case A and the threads at the other end of your grid to handle flag case B (in fact this idea is a component of a radix sort, after a fashion).

This may (badly) disrupt coalescing behavior, so it may not be ultimately performance-useful. But you can squeeze the balloon in a different way to see if it may help. Yes, it will cost something to do the prefix sum, so you would also have to see if the added cost for that is offset at all.

I guess you could also try this to see if it improves anything:

__device__ void ProcessPacket(uint8_t* __restrict__ data) {
if (df) {
    poly1305_update(&polyState, &data[64], pHeader+pSize);
}
chacha20_ietf_xor(data, data, pSize, data, 1, &data[16]);
if (!df) {
    poly1305_update(&polyState, &data[64], pHeader+pSize);
}
1 Like

(1) Nothing in the quoted profiler output indicates issues with warp divergence. If I interpret the quoted language correctly, the code suffers from poor memory access pattern(s), which in turn has the compute units spend much time waiting for data to arrive.

(2) The code snippet shows data as the only input to ProcessPacket. Yet ProcessPacket operates on a bunch of other data, such a df, polyState, pHeader, and pSize. Where are the coming from? Is this an extract from the actual code?

(3)

After running with coherent pSize

If by “coherent” you mean something like a compile-time (#defined?) constant value it wouldn’t be surprising for the compiler to be able to apply additional optimizations (e.g. full loop unrolling followed by improved load scheduling).

Generally speaking, for things like a binary flag as a function input, a good strategy can be to templatize the function, making the binary flag a template argument. If there are a handful of discrete packet sizes, the same approach can be applied as well, i.e. packet size becomes a template argument. This is a form of function specialization, which may enable improved code optimization. At run time, select the appropriate specialized instantiation to run, possibly via a function pointer.

1 Like

Please see my previous reply. I hope to have answered your questions

Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only issues an instruction every 4.8 cycles. This might leave hardware resources underutilized and may lead to less optimal performance. Out of the maximum of 12 warps per scheduler, this kernel allocates an average of 4.12 active warps per scheduler, but only an average of 0.21 warps were eligible per cycle

The warp state graph shows nearly 13.6 cycles just on Stall Wait.

Each scheduler is issuing 1 instruction every 4.8 cycles. This means the sum of the Warp State table (all reasons) should be 4.8; however, you report the warp state graph shows 13.6 cycles for Stall Wait. This doesn’t make sense. Moreover, a Wait Stall of 13.6 is near impossible to generate with optimized compiles.

Please check that you have -O2 or -O3 specified. If you look in the NCU Source View at the SASS you should also be able to tell as there will be a lot of STL (store to local) and zeroing of registers in a debug build to extend variable state to the scope of the variable.

There are a few issues with divergence.

  1. Active warps can drop below Theoretical Warps Per Scheduler if warps in a thread block complete earlier. This is a form of tail effect. If there is a big difference between Scheduler Statistics Theoretical Warps Per Scheduler and Active Warps Per Scheduler the kernels has 1 of 2 issues:
    A. Warps in a thread block complete early. A new thread block cannot be launched until all warps have completed (true for Ada and earlier).
    B. There is a tail effect at the end of the kernel due to the number of waves of thread blocks or simply the longer duration of some warps. The Launch Statistics Wavs Per SM will help identify if the tail effect is due to the grid launch blocks / (SM count x blocksPerSM).

  2. Thread divergence in a warp can be done by a branch or by predication. Generally, instruction predication is used for a small block of instructions. In terms of efficiency the goal is to have all 32 threads active and predicated on for each instruction. In the case of a branch both Avg. Active Threads Per Warp and Avg. Not Predicated Off Threads Per Warp will be less than 32. If predication is used heavily then Avg. Not Predicated Off Threads Per Warp will be less than Avg. Active Threads Per Warp. The Source View can be used to see statistics per source line or per SASS instruction.

  3. The Eligible Warps Per Scheduler can be increased by increasing occupancy. The first step would be to resolve tail effects. With 3.12 Avg. Active Warps Per Scheduler the grid must be able to fit at least 3 thread blocks per SM (256 threads per block).
    If 3 blocks per SM then waves per SM 1024 blocks / (3 x 128) = 2.6
    If 4 blocks per SM then waves per SM 1024 blocks / (4 x 128) = 2
    If the Occupancy section does not show that you can fit 4 blocks (1024) threads then you may want to optimize the kernel to fit 4 thread blocks (shared memory or registers). The benefit will be (a) increased eligible warps and (b) reduction of the tail effect due to waves.

1 Like

You’re right. It appears I’ve made a mistake: when I profiled the code in Nsight Compute, I compiled without optimization flags and with -G. I’ve profiled the code again with -Xptxas -O3 and can determine the following:

The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device.

The memory access pattern for global loads in L1TEX might not be optimal. On average, this kernel accesses 1.7 bytes per thread per memory request; but the address pattern, possibly caused by the stride between threads, results in 12.4 sectors per request, or 12.432 = 395.5 bytes of cache data transfers per request. The optimal thread address pattern for 1.7 byte accesses would result in 1.732 = 55.7 bytes of cache data transfers per request, to maximize L1TEX cache performance

The memory access pattern for local loads in L1TEX might not be optimal. On average, this kernel accesses 1.7 bytes per thread per memory request; but the address pattern, possibly caused by the stride between threads, results in 5.7 sectors per request, or 5.732 = 182.1 bytes of cache data transfers per request. The optimal thread address pattern for 1.7 byte accesses would result in 1.732 = 55.7 bytes of cache data transfers per request, to maximize L1TEX cache performance.

The memory access pattern for local stores in L1TEX might not be optimal. On average, this kernel accesses 1.7 bytes per thread per memory request; but the address pattern, possibly caused by the stride between threads, results in 5.4 sectors per request, or 5.432 = 172.8 bytes of cache data transfers per request. The optimal thread address pattern for 1.7 byte accesses would result in 1.732 = 55.7 bytes of cache data transfers per request, to maximize L1TEX cache performance

Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only issues an instruction every 6.6 cycles. This might leave hardware resources underutilized and may lead to less optimal performance. Out of the maximum of 12 warps per scheduler, this kernel allocates an average of 1.95 active warps per scheduler, but only an average of 0.17 warps were eligible per cycle.

The 2.00 theoretical warps per scheduler this kernel can issue according to its occupancy are below the hardware maximum of 12

On average, each warp of this kernel spends 6.2 cycles being stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. Find the instruction producing the data being waited upon to identify the culprit. To reduce the number of cycles waiting on L1TEX data accesses verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit rates by increasing data locality (coalescing), or by changing the cache configuration. Consider moving frequently used data to shared memory… This stall type represents about 48.3% of the total average of 12.9 cycles between issuing two instructions.

The majority of time seems to be waited on “Stall Long Scoreboard” now. I’m trying to make sense of why this happens, especially considering that pretty much everything is supposedly in registers. Also the global memory access with the random index isn’t the bottleneck since commenting out everything but the memcpy onto data gives me 20-fold performance.

Why would memory access be such an issue when pSize and pHeader are random per thread, but not when they are static? I’m afraid that I might be entering territory where the compiler optimizes the Chacha20 implementation for the static size…

I will get a reference profile now where the parameters are static and see how that impacts things.

You’ve mentioned registers a few times now. I wouldn’t get hung up on that. The register keyword probably has no effect, and making conclusions about register usage being some sort of justification for something or other won’t be productive for you. If you really need to know about register utilization (which I doubt) you need to study the actual SASS code generated by the compiler. It’s nearly impossible to make guesses or determinations from source code, or general expectations. For example, supposing that this is stored in registers:

is not sensible.

We don’t need to get wrapped around the axle on this, I’m not going to argue the point. The discovery that you’re profiling debug code is an important step forward and completely contrary to this statement:

So I would clean the slate and start over, and not use preconceived notions about register usage.

1 Like

You should not have to specify this explicitly, since that is the nvcc default. In general, I would advise against manually modifying ptxas behavior with -Xptxas -O{n} unless investigating a specific compiler bug or code generation issue.

1 Like

I’m currently comparing static size versus variable size compiled with proper optimization flags. It’s roughly 400 Gb/s memory throughput compared to 200 Gb/s and around twice as fast (have not removed the decrypt flag). First observation is that in the static size version, there is no warp stalls on STORE32 which is heavily used in the Chacha20 implementation (it’s a modified version of the reference implementation). What stands out is that “Access size” in the static size version seems to be 64 bytes and upwards, whereas in the dynamic size version, it’s much smaller sizes.

I guess that makes sense – if memory accesses are larger and thus coalesced, performance is much higher. Before digging further, I’d like to throw out the idea that the compiler optimized my Chacha20 implementation for processing exactly the amount of bytes passed in ProcessPacket. Even though I don’t see how it would have done so considering it operates on 64 bytes blocks…

May I ask if someone can chime in and tell me whether this is likely. It appears the culprit is memory access after all, although my accesses aren’t particularly complex, it’s just Chacha20 being ran with random size. One thing I should probably mention is that the way chacha20_ietf_xor is actually called is as below:

__device__ void ProcessPacket(uint8_t* __restrict__ data) {
if (df) {
    poly1305_update(&polyState, &data[64], pHeader+pSize);
    chacha20_ietf_xor(&data[pHeader], &data[pHeader], pSize, data, 1, &data[16]);
}
if (!df) {
    chacha20_ietf_xor(&data[pHeader], &data[pHeader], pSize, data, 1, &data[16]);
    poly1305_update(&polyState, &data[64], pHeader+pSize);
}

Whereas pHeader is not known at compile time. Perhaps this (random) offset is a contributing factor why the compiler is not optimizing as much?

I figure that it might be hard for anyone to help me considering I’m only sharing specific parts. If I can add anything that would allow anyone to contribute better, please let me know.

Again, thanks for everyone’s help. It’s really appreciated and I hope to be able to give back to the community once I’m better versed in CUDA development.

Does nvcc consider asserts for optimization in any way?

Sorry to over-reply, I hope this doesn’t break any rules, if it does, please let me know.

I now proceeded to floor the variables:

    int pHeader = 16;
    int pSize = 512;
    int pHeaderParse = data[8];
    int pSizeParse = (data[4]) * 16;
    (pSizeParse > 512) ? pSize = pSizeParse : pSize = 512;
    (pHeaderParse > 16) ? pHeader = pHeaderParse : pHeader = 16;

in a silly attempt to trick the compiler into optimizing more.

The result is equally silly, it’s about 10% faster.

I figure the right way to proceed is to rewrite and profile the code until it uses the larger memory accesses as it does with the fixed size code. Unfortunately I lack the experience to pin-point where to start. I’m pretty much doing black box engineering here.

When pondering such issues, speculating how code may have been transformed under the hood is rarely useful. There is really no substitute for learning to decipher the generated machine code (SASS). cuobjdump --dump-sass will provide a simple disassembly; you can also use nvdisasm for a somewhat more comprehensive set of functionality. It will take some time to familiarize oneself with SASS, as NVIDIA provides only a superficial explanation of the hardware instructions. I would suggest starting with looking at the translation of small functions.

Only an NVIDIA compiler specialist can answer that authoritatively. In my observation, no. One key technique enabling optimizations is to turn as much information as possible into compile-time constants. This could be #defined constants in the simplest case, or the use of template arguments, as I mentioned earlier. The CUDA compiler propagates constants aggressively, which in turn opens up opportunities for other optimizations.

Appropriate use of const and __restrict__ can also convey additional information to the compiler that may allow it to generate better code. See the Best Practice Guide for further ideas.

The CUDA profiler can help you pinpoint the location of bottlenecks. It’s a powerful tool, and as with any sophisticated code analyzer, you will need to spend some quality time with it to reap the full benefits. Generally speaking, experience grows with the amount of hands-on work one does on a given platform. Most people see attractive increases in application performance after “playing” with CUDA for a week or two (and might stop there if that’s all they need).

To get to ninja level will require months of intensive work and constant skills refreshment as CUDA continues to develop. That is not much different from, say, becoming an expert x86-64 SIMD performance tuner. From having worked with both traditional explicit-width SIMD and CUDA for many years, I would claim the CUDA way is more flexible and doesn’t come with the burden of re-learning a new set of primitives with each new SIMD generation (SSE, AVX, AVX-512), the CUDA compiler abstracting away most of that.

2 Likes