It has been my understanding for the longest time that:
a. Thread Blocks do NOT necessarily get executed inblockIdx.x order.
b. A Thread Block, once dispatched will run to completion, without preemption.
I have two recent observations that made me re-visit these principles.
i. " The shape of a grid (1-D or 2-D) influences the order in which thread blocks are picked. For 1-D grids, thread blocks are picked in increasing order of thread block ID. For 2-D grids, thread blocks are picked in a pattern resembling a space-filling Hilbert curve aimed perhaps at preserving 2-D locality." from the paper titled How the Fermi Thread Block Scheduler Works (Illustrated).
ii. This implementation of inclusive-scan. It is based on Merrill & Garland’s paper, but instead of depending on a global atomic counter for block identification and ordering as the paper prescribed, it uses the blockIdx.x built-in. Yet the code works without errors and memcheck, racecheck, synccheck and initcheck tools of compute-sanitizer do not issue any warnings or errors.
In light of these, I am suspecting that I might have missed something and would appreciate your input.
CUDA specifies no block dispatch order. The Merrill/Garland paper makes reference to this in section 4.4:
Virtual processors. Many programming models such as
CUDA employ an abstraction of virtual processors that are
dynamically scheduled on the hardware’s physical processors4
.
Each processor is given a numeric rank by which it can index its
corresponding input partition. However, the runtime that schedules virtual processors may run them [in] arbitrary order and without preemption.
(emphasis added)
That entire section is worth reading for the topic here.
Depending on unspecified behavior might work, but does not indicate a reliable programming method.
Thank you for your reply. That has been my understanding all along. The empirical evidence but more importantly the article on Fermi’s scheduler from U. of Rochester made me question it.
If something is unspecified, that does not mean the hardware must behave in an entirely unpredictable way. Given sufficient a-priori design knowledge, or else reverse engineering or analysis, its possible that someone could identify deterministic behavior in a particular implementation of the CUDA programming model. (I haven’t read the UofR paper and don’t know what to make of it. However it would be surprising to me if an external entity could provide exhaustive enough coverage to prove such a thing even for a single instance/implementation/article.)
But without a specification or guarantee, that knowledge, to my way of thinking, is not particularly valuable and does not constitute “useful” information that can be relied on from the programmer’s point of view. Creating a demonstrator that “seems to work” doesn’t change my view.
Those are just my own opinions. Anyone is of course welcome to program in any manner they see fit.
If you use the performance on the same architecture (your own system or embedded system with always the same HW+SW), and a performance degradation is not critical, then it may be fine to use an observed behavior. E.g. for batch processing.
But you better should not rely on it for correctness or if minimum performance is critical (e.g. safety / real-time).
(But I am just a user, orient yourself on Robert’s warning).
Fun fact: The decoupled look-back algorithm which is used in CCCL implicitly makes assumptions on block ordering, with the authors saying “This is not strictly safe within the CUDA programming model, but we take advantage of the fact that our current HW behaves this way.”
Another fun fact:
I changed the code I was referring, to use a global atomic for block ordering and identification, instead of blockIdx.x and observed a slight performance improvement!