Can kernels in one stream signal availability of data to kernels in a different stream without using events

The L1 is typically described as a “write-through” cache, not a “write-back” cache. There is nothing to flush. We can do a simple thought experiment: If this were not the case, then even a simple cudaMemcpy after a kernel finishes could get “stale” data.

Rather than approaching it from this perspective, my expectation is that when a kernel starts, and reads data from the global space, it will get a proper view of the global space.