L2 persistence clarifications

mcarilli2 · February 2, 2024, 1:27am

This topic is a partial duplicate of Persistant L2 API restrict in stream, what if in other stream?, but I’m trying to flesh things out a bit more. An answer to this topic should satisfy the other guy too, I hope.

The L2 persistence documentation is ambiguous to me.

Here are my competing mental models. Both seem consistent with the programming guide.

Model 0 (unlikely, but i can’t be sure it’s wrong):
cudaStreamSetAttribute(streamA, ...sets persistence for region R) somehow establishes that accesses to R enqueued in streamA obey the persistence settings, but accesses to R from kernels in different streams do not. I don’t see how this is possible. If a thread from some kernel accesses some address, I don’t see how L2 can know which user-facing stream launched that kernel (especially since user-facing streams are a programming model abstraction).

Model 1 (more likely, but again im not sure):
The “stream” argument to cudaStreamSetAttribute is just a hint that establishes typical (async) stream ordering with which persistence for a region is communicated to the L2 hardware. For example, if I call

cudaStreamSetAttribute(streamA, ...sets persistence for region R);
kernelA<<<grid, block, 0, streamA>>>(...);
kernelB<<<grid, block, 0, streamB>>>(...);

I’m guaranteed kernelA’s accesses to R will obey the persistence settings, because it’s in streamA.
If kernelB also accesses R, those accesses might obey the same settings if kernelB happens to run after or around the same time as kernelA (ie, after settings are communicated to L2 “in streamA”).
In this model, the following

cudaStreamSetAttribute(streamA, ...sets "persistenceA" for region R);
cudaStreamSetAttribute(streamB, ...sets "persistenceB" for region R);
kernelA<<<grid, block, 0, streamA>>>(...);
kernelB<<<grid, block, 0, streamB>>>(...);

incurs a “race condition” on the persistence status of R, and the behavior of accesses to R by each kernel depends on when they run relative to when persistenceA and persistenceB get communicated to L2. The L2 effects of the SetAttribute call and kernel execution in streamA may slide past their counterparts in streamB, and the overall sequence may “happen” on the device in any order allowed by the (independent) orderings of streamA and B.
A further consequence of this model is: as seen by kernels in any stream, there’s only ever one active persistence behavior for a particular address range. For example, if i call

cudaStreamSetAttribute(streamA, ...sets "persistenceA" for region RA);
cudaStreamSetAttribute(streamB, ...sets "persistenceB" for region RB that overlaps with RA);
kernelA<<<grid, block, 0, streamA>>>(...);
kernelB<<<grid, block, 0, streamB>>>(...);

the behavior for the overlapping region gets overwritten by whichever SetAttribute call happens last on the device, and accesses to the overlapping region by either kernel obey whichever behavior was set most recently. It’s not true that kernelA’s accesses (by virtue of being in streamA) always obey persistenceA, and kernelB’s accesses (by virtue of being in streamB) always obey persistenceB.

Which of these models (if either) is correct? Model 1 makes more sense to me.
(An implicit assumption of Model 1 is that all streams transparently share the L2 cache. In other words, if some cache line was pulled into L2 by a kernel in stream A, a later access to the same line from a kernel in stream B can hit, and won’t be assigned a separate redundant line in L2. But i’m almost certain this is true. Please tell me if it’s not!)

Another point that’s unclear from the docs for either model: Is there a limit on the number of address regions for which you can simultaneously set persistence properties?

mcarilli2 · February 7, 2024, 6:34pm

bump…can anyone help with this?

mcarilli2 · February 12, 2024, 4:36pm

bump one more time because it’s not just an idle question, it may be helpful for some work we’re doing.
If this isn’t the appropriate forum, please let me know where I should ask.

Robert_Crovella · February 12, 2024, 5:10pm

There are at least 2 places where there is some documentation on L2 persistence, in the programming guide and in the best practices guide.

I haven’t studied your post in great detail, but in going through it more than once, for me, I distill it down to a couple things:

Probably best to take the documentation at face value
Wondering about what happens to kernels launched into streams that have no persistence specification.

I’m quite certain that if a cache line is pulled into the L2 (and not evicted), a subsequent request for data in that cache line will hit in the L2.

Beyond that, the behavior of overlapping regions or simply multiple streams sharing the same L2 persistence carveout is not fully specified, and not sure it ever will be. The cited documentation makes reference to this case, and does not give a precise answer but mentions “sharing” the carveout. Given that the mechanism seems to suggest a probabilistic aspect, I’m not sure that a detailed access-by-access specification will ever be given. (For example, for a hit ratio less than 1.0, no formula is given that I can see to precisely determine a-priori whether a given address will be cached (i.e. will persist) or not.)

If you launch a kernel into a stream that doesn’t have a persistence “spec”, then there is no reason to assume that data requested by that kernel will make it into the persistence region (i.e. the carveout). It may, but I don’t see a detailed description of this. If it matters to you, you should probably provide a persistence spec for that stream.

You’re welcome to file a bug to request any documentation clarifications you would like to see.

Topic		Replies	Views
Persistant L2 API restrict in stream, what if in other stream? CUDA Programming and Performance	3	425	February 2, 2024
Can we allocate several L2 cache access window? CUDA Programming and Performance	0	464	August 1, 2022
Wanted to know about setting aside L2 cache memory Jetson AGX Orin cuda	4	238	April 22, 2024
Can L2 cache persistant policy be changed when kernel is running? CUDA Programming and Performance	8	729	August 1, 2022
Device to L2 memory less than L2 to L1 CUDA Programming and Performance gpu	19	1121	October 16, 2023
Overlapping CPU and GPU code. CUDA Programming and Performance	6	1607	February 27, 2016
L1 Cache, L2 Cache and Shared memory in Fermi CUDA Programming and Performance	5	23550	March 21, 2011
async memcopy/kernel from different contexts overlaping operations from different contexts.. CUDA Programming and Performance	9	2949	December 18, 2008
Difference between L2 read/write transactions and L2_L1 read/write transactions ? CUDA Programming and Performance	3	1482	August 28, 2019
Cannot get any stream parallelism. CUDA Programming and Performance	13	1301	December 31, 2019

L2 persistence clarifications

Related topics