On Ampere architecture, L1 cache resides in each SM. L2 is global.
Now if I write to L2 cache with volatile, does it guarantee coherence by invalidating L1 as well?
More specifically, say A100 has 108 SMs, now if all SMs happen to have a buffer cached in their L1. I schedule two kernels, kernelA and kernelB on the same stream, where:
kernel A: one thread block/SM is launched, and will write to this buffer with volatile which bypasses L1. Let’s assume all bytes are written as unsigned char value of 255
kernel B: many thread blocks/SMs will check for the value and confirm it is indeed 255
My questions:
- Writing to L2 directly: Will ALL the concerned L1 cache lines in ALL SMs be invalided to ensure coherence after kernel A? and therefore make all SMs in kernel B correctly read from L2 instead of their stale L1 copy and see the expected values?
- Writing to L1: Say if in kernel A, the SM writes by simply dereferencing, it is supposed to only guarantee L1 write complete. But in Step B we would always assume written value to be visible. Is it guaranteed that between kernel launches, the L1 data will be flushed all the way and all other SMs’ L1 cache invalidated as well?