I program in C++ using CUDA 12.2 and Windows.
I know that the programming guide says, “…inter-kernel communication is undefined…”,
but I also know that kernels communicate data to other kernels through global memory all the time, with proper synchronization.
I have been puzzling over this:
Buffer is a large array in global memory.
Flag is an integer in global memory, initially 0.
kn are kernels.
In stream 1: k1[writes to Buffer] k2[sets Flag]
In stream 2: k3[sees set Flag, so it tail-launches k4] k4[reads from Buffer]
k1 does not use any thread fence or atomic operations.
k2 uses an atomic operation to set Flag.
k3 uses volatile reads to read or poll Flag.
k4 reads are not volatile or atomic reads.
The grids for k1 and k4 may have many blocks.
The grids for k2 and k3 have one block.
The host may launch k3 periodically, and k3 may simply exit if it does not see Flag set.
The host never launches k4.
I am interested in whether k4, if it is launched, will read only valid data from Buffer.
I.e., does CUDA guarantee that k4 reads the data written by k1 in the following sequence:
k1 writes a batch of data to Buffer
k1 ends
k2 starts
k2 sets Flag to 1
k3 observes that Flag is 1, so it tail-launches k4
k3 ends
k4 starts
k4 reads the batch of data from Buffer
This seems like a useful pattern, but I cannot find anything in the Programming Guide that suggests it may be used reliably.
I saw the following, but I lack the mathematical mind necessary to apply it:
memory consistency model
I wrote a program that implements the above sequence. It tests the sequence endlessly,
using tail launches seeded with a single host launch into each stream. Buffer is managed like a
FIFO, with code included to prevent overflow by k1/k2.
The program runs on my GTX 1070 Ti without detecting any errors (except when I deliberately have k1 introduce them). This is just FYI. I realize that this success may be limited to my hardware and the program I wrote.