L2 persistence clarifications

This topic is a partial duplicate of Persistant L2 API restrict in stream, what if in other stream?, but I’m trying to flesh things out a bit more. An answer to this topic should satisfy the other guy too, I hope.

The L2 persistence documentation is ambiguous to me.

Here are my competing mental models. Both seem consistent with the programming guide.

Model 0 (unlikely, but i can’t be sure it’s wrong):
cudaStreamSetAttribute(streamA, ...sets persistence for region R) somehow establishes that accesses to R enqueued in streamA obey the persistence settings, but accesses to R from kernels in different streams do not. I don’t see how this is possible. If a thread from some kernel accesses some address, I don’t see how L2 can know which user-facing stream launched that kernel (especially since user-facing streams are a programming model abstraction).

Model 1 (more likely, but again im not sure):
The “stream” argument to cudaStreamSetAttribute is just a hint that establishes typical (async) stream ordering with which persistence for a region is communicated to the L2 hardware. For example, if I call

cudaStreamSetAttribute(streamA, ...sets persistence for region R);
kernelA<<<grid, block, 0, streamA>>>(...);
kernelB<<<grid, block, 0, streamB>>>(...);

I’m guaranteed kernelA’s accesses to R will obey the persistence settings, because it’s in streamA.
If kernelB also accesses R, those accesses might obey the same settings if kernelB happens to run after or around the same time as kernelA (ie, after settings are communicated to L2 “in streamA”).
In this model, the following

cudaStreamSetAttribute(streamA, ...sets "persistenceA" for region R);
cudaStreamSetAttribute(streamB, ...sets "persistenceB" for region R);
kernelA<<<grid, block, 0, streamA>>>(...);
kernelB<<<grid, block, 0, streamB>>>(...);

incurs a “race condition” on the persistence status of R, and the behavior of accesses to R by each kernel depends on when they run relative to when persistenceA and persistenceB get communicated to L2. The L2 effects of the SetAttribute call and kernel execution in streamA may slide past their counterparts in streamB, and the overall sequence may “happen” on the device in any order allowed by the (independent) orderings of streamA and B.
A further consequence of this model is: as seen by kernels in any stream, there’s only ever one active persistence behavior for a particular address range. For example, if i call

cudaStreamSetAttribute(streamA, ...sets "persistenceA" for region RA);
cudaStreamSetAttribute(streamB, ...sets "persistenceB" for region RB that overlaps with RA);
kernelA<<<grid, block, 0, streamA>>>(...);
kernelB<<<grid, block, 0, streamB>>>(...);

the behavior for the overlapping region gets overwritten by whichever SetAttribute call happens last on the device, and accesses to the overlapping region by either kernel obey whichever behavior was set most recently. It’s not true that kernelA’s accesses (by virtue of being in streamA) always obey persistenceA, and kernelB’s accesses (by virtue of being in streamB) always obey persistenceB.

Which of these models (if either) is correct? Model 1 makes more sense to me.
(An implicit assumption of Model 1 is that all streams transparently share the L2 cache. In other words, if some cache line was pulled into L2 by a kernel in stream A, a later access to the same line from a kernel in stream B can hit, and won’t be assigned a separate redundant line in L2. But i’m almost certain this is true. Please tell me if it’s not!)

Another point that’s unclear from the docs for either model: Is there a limit on the number of address regions for which you can simultaneously set persistence properties?

bump…can anyone help with this?

bump one more time because it’s not just an idle question, it may be helpful for some work we’re doing.
If this isn’t the appropriate forum, please let me know where I should ask.

There are at least 2 places where there is some documentation on L2 persistence, in the programming guide and in the best practices guide.

I haven’t studied your post in great detail, but in going through it more than once, for me, I distill it down to a couple things:

  1. Probably best to take the documentation at face value
  2. Wondering about what happens to kernels launched into streams that have no persistence specification.

I’m quite certain that if a cache line is pulled into the L2 (and not evicted), a subsequent request for data in that cache line will hit in the L2.

Beyond that, the behavior of overlapping regions or simply multiple streams sharing the same L2 persistence carveout is not fully specified, and not sure it ever will be. The cited documentation makes reference to this case, and does not give a precise answer but mentions “sharing” the carveout. Given that the mechanism seems to suggest a probabilistic aspect, I’m not sure that a detailed access-by-access specification will ever be given. (For example, for a hit ratio less than 1.0, no formula is given that I can see to precisely determine a-priori whether a given address will be cached (i.e. will persist) or not.)

If you launch a kernel into a stream that doesn’t have a persistence “spec”, then there is no reason to assume that data requested by that kernel will make it into the persistence region (i.e. the carveout). It may, but I don’t see a detailed description of this. If it matters to you, you should probably provide a persistence spec for that stream.

You’re welcome to file a bug to request any documentation clarifications you would like to see.