SASS divergence prevention of non-divergent code

Am currently porting an ADS-B decoder (aircraft transponder receiver) from optimized CPU to CUDA. Input is 2.5M I/Q samples/sec CPU converted to 2.5M 8-bit magnitude samples/sec. Unlike my CPU versions, the GPU version leverages the plethora of registers to implement a true 320-byte sliding window across 16 threads (20-bytes of magnitude data corresponding to one byte of receiver data per thread). The code has multiple short-circuit tests to quickly determine when the current window has invalid data and shift forward in as few cycles as possible. My design pattern always warp shuffle broadcasts the short-circuit data to all threads to prevent divergence. SASS appears unaware and inserts code to compensate for divergence that cannot happen. Current performance is dismal and while unsure the divergence code is the culprit, it is certainly not improving things and it complicates code analysis. Adding explicit _syncwarp() calls at strategic locations makes no difference.

Question-- is there any way to tell SASS that divergence cannot happen for a particular branch (or for all) and don’t add reconvergence code?

Question-- do CALL/RET use device memory? Nsight Compute shows the expected global and L2->L1 traffic (256K samples per run) but shows 53MB! of Device Memory access (and local/texture/surface/shared access are all zero). The PTX call/ret description glosses over where the “address of the next instruction” is actually stored.

For whatever reason, a number of my shfl.sync.idx instructions are translated into a weird abomination of branch and call instructions before reaching the actual shfl instruction that appears to duplicate the subsequent shfl instruction. Example:

    GPUWorker<<<1, 16, 0, stream>>>(outbuf, outlen, inpbuf, inplen);

     <useful code here>
00b90d40:
     BRA.DIV 0x800b91550                      // this code path appears to duplicate the next line
     SHFL.IDX PT, R4, R52, 0x2, 0x101f // all code shown besides this could be removed
00b90d60:
     <useful code here>
     ...

(subroutine city)
00b91550:
     IMAD.MOV.U32 R4, RZ, RZ, R52 
     MOV R34, 0x16b0 
     IMAD.MOV.U32 R5, RZ, RZ, 0x2 
     IMAD.MOV.U32 R33, RZ, RZ, 0x101f 
     IMAD.MOV.U32 R36, RZ, RZ, -0x1 
     CALL.REL.NOINC 0x800b91770 
     IMAD.MOV.U32 R4, RZ, RZ, R33 
     BRA 0x800b90d60 

0x800b91770 :
     WARPSYNC R36 
     SHFL.IDX PT, R33, R4, R5, R33    // this appears a dup of the inline shfl above
     IMAD.MOV.U32 R4, RZ, RZ, R34 
     IMAD.MOV.U32 R5, RZ, RZ, 0x0 
     RET.REL.NODEC R4 0x800b8ff00 

Based on the shfl width word (0x101f), guessing SASS thinks 32 threads where launched (though the explicit block size of 16 seems difficult to miss) and the divergence is between the 16 working threads and the 16 never launched threads? (Changing the shfl thread-mask from 0xffffffff to 0x0000ffff made no difference). Appreciate any pointers!

Current performance is dismal and while unsure the divergence code is the culprit

I would suggest spending some quality time with the CUDA profiler to establish the root cause(s) of poor performance, before doing a deep dive on divergence.

At a macro level, the performance challenges are due to the overhead of the sliding window and the need to exchange some state between threads to accept/reject a signal sample. That dependence limits parallelism resulting in lots of stalls. The short-circuit evaluation approach (quickly identify invalid samples and shift forward) helps compensate, but makes SASS emit instructions to prevent it and the profiler emit warning about it.

Ironically, when trying a quick hack to divide the sample frame in half to give to its own thread set, it locked up (or was running deathly slow). Took a minute to realize that was a legit case of thread divergence as there was no synchronization of the short-circuit branches between the two thread sets (in the same warp). The code added by SASS failed to mitigate the divergence bringing me back to my original query of whether it can be suppressed.

Am assuming BRA.DIV is “branch if any program counters mismatch”. Every SHFL (or group of SHFLs) has an alternate divergence block included. In the non-divergent case, BRA.DIV appears to be a NOP. SASS also adds BSYNC/BSSY/BMOV barrier instructions around the main loop. The profiler indicates these are causing some waits, though not material, just distracting.

The compiler backend ptxas handles machine-specific requirements such as divergence and convergence points. This is fully automatic and there are no programmer-accessible knobs to influence this, best I know. NVIDIA has never publicly documented the divergence mechanisms of their various GPU architectures and the related instructions in sufficient detail to enable a solid assessment of the quality of the compiler’s efforts in this regard. I am certainly not able to render such an assessment.

If you are quite convinced that the compiler is doing something sub-optimal here, you could always file a bug.

Finally figured out what is going on. Am running Turing/7.5/11.0 and there are some architecture specific aspects. Turing does not have the “sync” variations of shfl, vote, etc. and must emulate them. If SASS decides it needs them, each PTX shfl.sync.type or vote.sync.type is replaced with a BRA.DIV to some stub code that calls the emulated function along with the non-sync instruction (that BRA.DIV falls thru to).

The emulated shfl.sync.idx is interesting. It is just WARPSYNC ; SHFL.. No idea why it does not inline that code and save all the overhead. A 2018 NVidia blog “Using CUDA Warp-Level Primitives” explicitly warns that WARPSYNC/SHFL/WARPSYNC is not a valid replacement. No idea why the blog contradicts the emulated instruction.

To my question of whether this code creates overhead, guessing the SASS folks think so. The emulation is only included if SASS decides the surrounding code is a divergence risk. The latter is based on if-condition analysis. An if (_shfl_sync(0xffffffff, var, lane, 32)) { code } will trigger a divergence risk whereas if (__any_sync(0x0000ffff, var > 0) { code } will not. I don’t believe the shfl_sync() can cause divergence when both the mask and width match the warp size and assume that is a missing optimization. Wrapping my short-circuit if-condition in __any_sync() eliminated the emulated code as SASS is no longer concerned about divergence and instead uses the “non-sync” versions of shfl, vote, etc.

I addressed my code performance issues by a partial 4-byte/sample unroll, refactoring to a single pre-filter short-circuit test, synchronizing the short-circuit decision across the rest of warp (allowing full warp) and dividing the input sample frame into multiple chunks to allow a larger block size. My overall approach of using the register/thread richness as a large shift-register for sliding sample analysis probably needs a rethink, but has been a good learning experience.