Possible race condition in TMA examples

mcarilli2 · July 22, 2024, 9:11pm

First of all I’d like to thank the author for Advanced Performance Optimization in CUDA | NVIDIA On-Demand, it’s a really interesting, information-rich talk.

However, I’m confused about one of the examples demonstrating async data copies via Tensor Memory Accelerator (TMA).

Around 14:20 of the video, he mentions that a barrier’s internal transaction_count should be >= 0.

However, around 1:04:55 (slide 127 of the video == slide 73 of the PDF, also screenshotted to be clear what im talking about) I think
the st_async call’s implicit (hardware-completion-fired) tx_count -= 16 decrements, applied to bar_next aka the neighboring block’s bar,
race with
the tx_count += 16384 that the neighboring block’s call to mbarrier_arrive_expect_tx applies to bar.

The blocks are independent and the hardware completions are async so I expect some of the -=16 decrements could occur before the +=16384 increment, and the barrier’s internal transaction count could temporarily dip below 0.

If i’m right, the cccl ptx example may have the same issue.

This question is not critical or time sensitive for me, but it seemed weird enough to be worth sharing.

mcarilli2 · July 23, 2024, 4:05pm

I should add: IF the barrier’s internal transaction count is allowed to temporarily go negative as the hardware receives stores, I think all is well and the API is neat because the consumer can be satisfied by any number of st_asyncs sending any number of bytes each, as long as the total number of bytes sent by producers == the amount expected by the consumer.

iterentyev · August 7, 2024, 3:27am

I meant that user-specified (e.g., by mbarrier_arrive_expect_tx) expected transaction count has to be non-negative.

The barrier phase flips when transaction count (tracked by a barrier) reaches 0.

It doesn’t matter if barrier-tracked transaction count first becomes positive due to mbarrier_arrive_expect_tx call and then gets decreased to 0 by HW completing st_async, or if it first becomes negative by HW completion and then gets increased to 0 by mbarrier_arrive_expect_tx call.
All that matters for phase flip is that transaction count reaches zero.

mcarilli2 · August 7, 2024, 5:06am

That makes sense, thank you. And thanks again for a great talk.

system · August 21, 2024, 5:06am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.