Hello,
I am in the process of porting an algorithm for data encryption/decryption to CUDA. For this purpose, I have successfully ported Chacha20 and Poly1305 to CUDA, which fuzz against their reference implementations and perform pretty well already. Now to the tricky part, I’m encrypting/decrypting variable-length packets, whereas both length and encryption or decryption (as a flag) are data-dependant.
__device__ void ProcessPacket(uint8_t* __restrict__ data) {
if (df) {
poly1305_update(&polyState, &data[64], pHeader+pSize);
chacha20_ietf_xor(data, data, pSize, data, 1, &data[16]);
}
if (!df) {
chacha20_ietf_xor(data, data, pSize, data, 1, &data[16]);
poly1305_update(&polyState, &data[64], pHeader+pSize);
}
The diverge here is clearly with both the (possible) branching for df
(decrypt flag) and pSize
(packet size).
To validate my assumption of a warp divergence issue, I have benchmarked the code with coherent df
flag within one block, which reduced execution time for 100M iterations by nearly 30%. After running with coherent pSize
, I identified a whooping 60% performance gain additionally. Lastly, I benchmarked the code with Nsight Compute, which is very unhappy with my code:
This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device.
Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only issues an instruction every 4.8 cycles. This might leave hardware resources underutilized and may lead to less optimal performance. Out of the maximum of 12 warps per scheduler, this kernel allocates an average of 4.12 active warps per scheduler, but only an average of 0.21 warps were eligible per cycle
The warp state graph shows nearly 13.6 cycles just on Stall Wait.
The card in question is a 4090.
I’m new to CUDA so I really could use some help, tips or tricks how I can improve performance. Right now, the warp divergence bottlenecks the code so much that a high-end CPU is performing equally. It’s worth noting that I can’t change the inherent algorithm, so the order of poly1305/chacha20 depending on df
is forced (unless I use a buffer that is later passed to poly1305_update).
I compiled with optimization flags obviously, I’m launching 1024 blocks and 256 threads currently. I can not “sort” the data in any way that would allow me to have coherent df
or pSize
in a thread group. The only thing I can think of is grouping all blocks+threads and then group them somehow. Naturally pSize
and df
are known before processing.
FWIW, I might be missing some CUDA basics here and the solution might be trivial. I appreciate any pointers in the right direction!