First of all I’d like to thank the author for Advanced Performance Optimization in CUDA | NVIDIA On-Demand, it’s a really interesting, information-rich talk.
However, I’m confused about one of the examples demonstrating async data copies via Tensor Memory Accelerator (TMA).
Around 14:20 of the video, he mentions that a barrier’s internal transaction_count should be >= 0.
However, around 1:04:55 (slide 127 of the video == slide 73 of the PDF, also screenshotted to be clear what im talking about) I think
the st_async
call’s implicit (hardware-completion-fired) tx_count -= 16
decrements, applied to bar_next
aka the neighboring block’s bar
,
race with
the tx_count += 16384
that the neighboring block’s call to mbarrier_arrive_expect_tx
applies to bar
.
The blocks are independent and the hardware completions are async so I expect some of the -=16 decrements could occur before the +=16384 increment, and the barrier’s internal transaction count could temporarily dip below 0.
If i’m right, the cccl ptx example may have the same issue.
This question is not critical or time sensitive for me, but it seemed weird enough to be worth sharing.