Can kernels in one stream signal availability of data to kernels in a different stream without using events

I program in C++ using CUDA 12.2 and Windows.

I know that the programming guide says, “…inter-kernel communication is undefined…”,
but I also know that kernels communicate data to other kernels through global memory all the time, with proper synchronization.

I have been puzzling over this:

Buffer is a large array in global memory.
Flag is an integer in global memory, initially 0.
kn are kernels.

In stream 1: k1[writes to Buffer] k2[sets Flag]
In stream 2: k3[sees set Flag, so it tail-launches k4] k4[reads from Buffer]

k1 does not use any thread fence or atomic operations.
k2 uses an atomic operation to set Flag.
k3 uses volatile reads to read or poll Flag.
k4 reads are not volatile or atomic reads.
The grids for k1 and k4 may have many blocks.
The grids for k2 and k3 have one block.
The host may launch k3 periodically, and k3 may simply exit if it does not see Flag set.
The host never launches k4.

I am interested in whether k4, if it is launched, will read only valid data from Buffer.

I.e., does CUDA guarantee that k4 reads the data written by k1 in the following sequence:
k1 writes a batch of data to Buffer
k1 ends
k2 starts
k2 sets Flag to 1
k3 observes that Flag is 1, so it tail-launches k4
k3 ends
k4 starts
k4 reads the batch of data from Buffer

This seems like a useful pattern, but I cannot find anything in the Programming Guide that suggests it may be used reliably.

I saw the following, but I lack the mathematical mind necessary to apply it:
memory consistency model

I wrote a program that implements the above sequence. It tests the sequence endlessly,
using tail launches seeded with a single host launch into each stream. Buffer is managed like a
FIFO, with code included to prevent overflow by k1/k2.

The program runs on my GTX 1070 Ti without detecting any errors (except when I deliberately have k1 introduce them). This is just FYI. I realize that this success may be limited to my hardware and the program I wrote.

1 Like

It is a commonly-used paradigm that a kernel is launched, and writes data to a global buffer, and then exits. Later, another kernel is launched and reads the data from that buffer.

I’m not aware of any hazards in that pattern.

Thanks for replying.

“I’m not aware of any hazards in that pattern.”

Ok. I may be overthinking the issue.

I was worried that k4 might see stale data from L1 rather than the data written by k1, just as k3 might see stale data were it to read Buffer.

Every kernel exit updates L2?
Every kernel start gives the kernel a fresh look at L2?

?

L2 is a device wide proxy for the device portion of the global space. I’m not aware of any device code activity that touches device global space that doesn’t go through the L2.

The general topic of communication/visibility of data between various concurrent activities in CUDA is a complex and involved topic, so I am intentionally steering towards simple statements that are easier for me to support, as opposed to a wide ranging tutorial. If you search on common keywords such as “volatile” you will find many such discussions here on this sub-forum.

I cannot think of issues with “stale” cache content in a system with an MMU where the caches are coherent with global memory. In other words, back to back kernel launches on a GPU work no differently in regards to this aspect than context switches on the CPU of the host system: explicit cache invalidation is not needed.

The situation is different for incoherent caches, such as the texture caches in older GPUs (not sure how that works in recent hardware). Those had to be invalidated as part of a kernel launch, and the CUDA runtime took care of that.

2 Likes

?

I was unclear. I meant, in rough words:

  • When each kernel ends, any L1 the kernel wrote to is flushed to L2?
  • As part of each kernel start, the L1 it will use is invalidated, so the initial read of a global location goes all the way to L2 (and beyond, if necessary)?

I realize this might happen on a per-block rather than a per-kernel basis.

The L1 is typically described as a “write-through” cache, not a “write-back” cache. There is nothing to flush. We can do a simple thought experiment: If this were not the case, then even a simple cudaMemcpy after a kernel finishes could get “stale” data.

Rather than approaching it from this perspective, my expectation is that when a kernel starts, and reads data from the global space, it will get a proper view of the global space.

Thanks, I did not know that.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.