Am currently porting an ADS-B decoder (aircraft transponder receiver) from optimized CPU to CUDA. Input is 2.5M I/Q samples/sec CPU converted to 2.5M 8-bit magnitude samples/sec. Unlike my CPU versions, the GPU version leverages the plethora of registers to implement a true 320-byte sliding window across 16 threads (20-bytes of magnitude data corresponding to one byte of receiver data per thread). The code has multiple short-circuit tests to quickly determine when the current window has invalid data and shift forward in as few cycles as possible. My design pattern always warp shuffle broadcasts the short-circuit data to all threads to prevent divergence. SASS appears unaware and inserts code to compensate for divergence that cannot happen. Current performance is dismal and while unsure the divergence code is the culprit, it is certainly not improving things and it complicates code analysis. Adding explicit _syncwarp() calls at strategic locations makes no difference.
Question-- is there any way to tell SASS that divergence cannot happen for a particular branch (or for all) and don’t add reconvergence code?
Question-- do CALL/RET use device memory? Nsight Compute shows the expected global and L2->L1 traffic (256K samples per run) but shows 53MB! of Device Memory access (and local/texture/surface/shared access are all zero). The PTX call/ret description glosses over where the “address of the next instruction” is actually stored.
For whatever reason, a number of my shfl.sync.idx instructions are translated into a weird abomination of branch and call instructions before reaching the actual shfl instruction that appears to duplicate the subsequent shfl instruction. Example:
GPUWorker<<<1, 16, 0, stream>>>(outbuf, outlen, inpbuf, inplen); <useful code here> 00b90d40: BRA.DIV 0x800b91550 // this code path appears to duplicate the next line SHFL.IDX PT, R4, R52, 0x2, 0x101f // all code shown besides this could be removed 00b90d60: <useful code here> ... (subroutine city) 00b91550: IMAD.MOV.U32 R4, RZ, RZ, R52 MOV R34, 0x16b0 IMAD.MOV.U32 R5, RZ, RZ, 0x2 IMAD.MOV.U32 R33, RZ, RZ, 0x101f IMAD.MOV.U32 R36, RZ, RZ, -0x1 CALL.REL.NOINC 0x800b91770 IMAD.MOV.U32 R4, RZ, RZ, R33 BRA 0x800b90d60 0x800b91770 : WARPSYNC R36 SHFL.IDX PT, R33, R4, R5, R33 // this appears a dup of the inline shfl above IMAD.MOV.U32 R4, RZ, RZ, R34 IMAD.MOV.U32 R5, RZ, RZ, 0x0 RET.REL.NODEC R4 0x800b8ff00
Based on the shfl width word (0x101f), guessing SASS thinks 32 threads where launched (though the explicit block size of 16 seems difficult to miss) and the divergence is between the 16 working threads and the 16 never launched threads? (Changing the shfl thread-mask from 0xffffffff to 0x0000ffff made no difference). Appreciate any pointers!