Can kernels in one stream signal availability of data to kernels in a different stream without using events

CU_Steve · August 4, 2023, 6:51pm

I program in C++ using CUDA 12.2 and Windows.

I know that the programming guide says, “…inter-kernel communication is undefined…”,
but I also know that kernels communicate data to other kernels through global memory all the time, with proper synchronization.

I have been puzzling over this:

Buffer is a large array in global memory.
Flag is an integer in global memory, initially 0.
kn are kernels.

In stream 1: k1[writes to Buffer] k2[sets Flag]
In stream 2: k3[sees set Flag, so it tail-launches k4] k4[reads from Buffer]

k1 does not use any thread fence or atomic operations.
k2 uses an atomic operation to set Flag.
k3 uses volatile reads to read or poll Flag.
k4 reads are not volatile or atomic reads.
The grids for k1 and k4 may have many blocks.
The grids for k2 and k3 have one block.
The host may launch k3 periodically, and k3 may simply exit if it does not see Flag set.
The host never launches k4.

I am interested in whether k4, if it is launched, will read only valid data from Buffer.

I.e., does CUDA guarantee that k4 reads the data written by k1 in the following sequence:
k1 writes a batch of data to Buffer
k1 ends
k2 starts
k2 sets Flag to 1
k3 observes that Flag is 1, so it tail-launches k4
k3 ends
k4 starts
k4 reads the batch of data from Buffer

This seems like a useful pattern, but I cannot find anything in the Programming Guide that suggests it may be used reliably.

I saw the following, but I lack the mathematical mind necessary to apply it:
memory consistency model

I wrote a program that implements the above sequence. It tests the sequence endlessly,
using tail launches seeded with a single host launch into each stream. Buffer is managed like a
FIFO, with code included to prevent overflow by k1/k2.

The program runs on my GTX 1070 Ti without detecting any errors (except when I deliberately have k1 introduce them). This is just FYI. I realize that this success may be limited to my hardware and the program I wrote.

Robert_Crovella · August 7, 2023, 3:20pm

It is a commonly-used paradigm that a kernel is launched, and writes data to a global buffer, and then exits. Later, another kernel is launched and reads the data from that buffer.

I’m not aware of any hazards in that pattern.

CU_Steve · August 7, 2023, 6:27pm

Thanks for replying.

“I’m not aware of any hazards in that pattern.”

Ok. I may be overthinking the issue.

I was worried that k4 might see stale data from L1 rather than the data written by k1, just as k3 might see stale data were it to read Buffer.

Every kernel exit updates L2?
Every kernel start gives the kernel a fresh look at L2?

Robert_Crovella · August 7, 2023, 6:30pm

?

L2 is a device wide proxy for the device portion of the global space. I’m not aware of any device code activity that touches device global space that doesn’t go through the L2.

The general topic of communication/visibility of data between various concurrent activities in CUDA is a complex and involved topic, so I am intentionally steering towards simple statements that are easier for me to support, as opposed to a wide ranging tutorial. If you search on common keywords such as “volatile” you will find many such discussions here on this sub-forum.

njuffa · August 7, 2023, 6:58pm

I cannot think of issues with “stale” cache content in a system with an MMU where the caches are coherent with global memory. In other words, back to back kernel launches on a GPU work no differently in regards to this aspect than context switches on the CPU of the host system: explicit cache invalidation is not needed.

The situation is different for incoherent caches, such as the texture caches in older GPUs (not sure how that works in recent hardware). Those had to be invalidated as part of a kernel launch, and the CUDA runtime took care of that.

CU_Steve · August 7, 2023, 8:14pm

?

I was unclear. I meant, in rough words:

When each kernel ends, any L1 the kernel wrote to is flushed to L2?
As part of each kernel start, the L1 it will use is invalidated, so the initial read of a global location goes all the way to L2 (and beyond, if necessary)?

I realize this might happen on a per-block rather than a per-kernel basis.

Robert_Crovella · August 7, 2023, 9:07pm

The L1 is typically described as a “write-through” cache, not a “write-back” cache. There is nothing to flush. We can do a simple thought experiment: If this were not the case, then even a simple cudaMemcpy after a kernel finishes could get “stale” data.

Rather than approaching it from this perspective, my expectation is that when a kernel starts, and reads data from the global space, it will get a proper view of the global space.

CU_Steve · August 7, 2023, 10:43pm

Thanks, I did not know that.

system · August 21, 2023, 10:44pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Thread safety of reading and writing different area of constant memory in multiple concurrently executed kernels? CUDA Programming and Performance	10	1028	March 27, 2023
Custom CPU to GPU ringbuffer CUDA Programming and Performance	21	13759	May 14, 2013
Many threads updating a single flag in global memory CUDA Programming and Performance	13	6528	May 9, 2011
Continuing global memory output between kernels CUDA Programming and Performance	2	489	August 23, 2019
CUDA Memory Consistency CUDA Programming and Performance	23	55577	March 8, 2007
Correct usage of ldcg and stcg for inter-block communication CUDA Programming and Performance cuda , kernel , linux	9	1327	February 7, 2023
Question about memory flush and synchronization memory flush and synchronization CUDA Programming and Performance	6	4711	July 23, 2008
Bandwidth & Kernel problems: performance degredation. CUDA Programming and Performance	8	5115	December 6, 2010
L1 Cache, L2 Cache and Shared memory in Fermi CUDA Programming and Performance	5	23542	March 21, 2011
Concurrent kernel execution CUDA Programming and Performance	2	384	March 26, 2024

Can kernels in one stream signal availability of data to kernels in a different stream without using events

Related topics