This topic is a partial duplicate of Persistant L2 API restrict in stream, what if in other stream?, but I’m trying to flesh things out a bit more. An answer to this topic should satisfy the other guy too, I hope.
The L2 persistence documentation is ambiguous to me.
Here are my competing mental models. Both seem consistent with the programming guide.
Model 0 (unlikely, but i can’t be sure it’s wrong):
cudaStreamSetAttribute(streamA, ...sets persistence for region R)
somehow establishes that accesses to R enqueued in streamA obey the persistence settings, but accesses to R from kernels in different streams do not. I don’t see how this is possible. If a thread from some kernel accesses some address, I don’t see how L2 can know which user-facing stream launched that kernel (especially since user-facing streams are a programming model abstraction).
Model 1 (more likely, but again im not sure):
The “stream” argument to cudaStreamSetAttribute
is just a hint that establishes typical (async) stream ordering with which persistence for a region is communicated to the L2 hardware. For example, if I call
cudaStreamSetAttribute(streamA, ...sets persistence for region R);
kernelA<<<grid, block, 0, streamA>>>(...);
kernelB<<<grid, block, 0, streamB>>>(...);
I’m guaranteed kernelA’s accesses to R will obey the persistence settings, because it’s in streamA.
If kernelB also accesses R, those accesses might obey the same settings if kernelB happens to run after or around the same time as kernelA (ie, after settings are communicated to L2 “in streamA”).
In this model, the following
cudaStreamSetAttribute(streamA, ...sets "persistenceA" for region R);
cudaStreamSetAttribute(streamB, ...sets "persistenceB" for region R);
kernelA<<<grid, block, 0, streamA>>>(...);
kernelB<<<grid, block, 0, streamB>>>(...);
incurs a “race condition” on the persistence status of R, and the behavior of accesses to R by each kernel depends on when they run relative to when persistenceA and persistenceB get communicated to L2. The L2 effects of the SetAttribute call and kernel execution in streamA may slide past their counterparts in streamB, and the overall sequence may “happen” on the device in any order allowed by the (independent) orderings of streamA and B.
A further consequence of this model is: as seen by kernels in any stream, there’s only ever one active persistence behavior for a particular address range. For example, if i call
cudaStreamSetAttribute(streamA, ...sets "persistenceA" for region RA);
cudaStreamSetAttribute(streamB, ...sets "persistenceB" for region RB that overlaps with RA);
kernelA<<<grid, block, 0, streamA>>>(...);
kernelB<<<grid, block, 0, streamB>>>(...);
the behavior for the overlapping region gets overwritten by whichever SetAttribute call happens last on the device, and accesses to the overlapping region by either kernel obey whichever behavior was set most recently. It’s not true that kernelA’s accesses (by virtue of being in streamA) always obey persistenceA, and kernelB’s accesses (by virtue of being in streamB) always obey persistenceB.
Which of these models (if either) is correct? Model 1 makes more sense to me.
(An implicit assumption of Model 1 is that all streams transparently share the L2 cache. In other words, if some cache line was pulled into L2 by a kernel in stream A, a later access to the same line from a kernel in stream B can hit, and won’t be assigned a separate redundant line in L2. But i’m almost certain this is true. Please tell me if it’s not!)
Another point that’s unclear from the docs for either model: Is there a limit on the number of address regions for which you can simultaneously set persistence properties?