L2 cache misses

anthonyJK1 · September 8, 2023, 2:05pm

Hello,

I was watching this webinar and I didn’t understand why when we have a 64 byte of stride we will have a double number of misses? I didn’t understand the idea that the L2 fetches the data from the memory at a granularity of 2 sectors. If it that the case, for a 32 stride we should have 16 misses and for a 64 byte stride we should have 32 misses.(starting 33min)

Last, I tried this on my Jetson Xavier, and I saw that after a stride of 32 I get Sector Misses to System constant.
Please if you need further information let me know.

Thank you for your support.

njuffa · September 8, 2023, 6:39pm

Whenever you reference video presentations, please mention the time mark at which the referenced content occurs. In this case it is at about 33 minutes.

Generally speaking sectored caches try to minimize tag storage while still allowing fine granularity of cache operations by storing status information (valid, dirty, etc.) per sector. This particular cache uses a cache line length of 128 bytes consisting of 4 sectors of 32 bytes each. However – and this strikes me as somewhat unusual – fetches from device memory occur at the granularity of two sectors. Why have four sectors then? Presumably writes actually happen with sector granularity.

So with an access stride of 64 bytes, each L2 fetch required involves the status data of two consecutive sectors: one for the sector the requested data is actually in, and one for the neighboring sector that it is paired with for L2 fetch purposes (one might think of this as a “phantom” miss caused by “overfetch” but one could question whether that is helpful for one’s mental model). Thus two misses are counted, one for each of the sectors involved. This keeps the profiling hardware and software simple which is tracking events by sector.

For an access stride of 32 there is a miss in every sector, all of which are “real” misses.

anthonyJK1 · September 8, 2023, 7:38pm

Thank you a lot for your reply.
It was not clear in the video about the idea of phantom miss. About the granularity of 64 bytes it was also mentioned in the video, however I couldn’t find elsewhere this information.

njuffa · September 8, 2023, 7:58pm

I am not familiar with NVIDIA’s embedded products. Have you checked whether NVIDIA has a relevant architectural whitepaper and whether this information is in there? Jetson Xavier belongs to the Volta architecture family, I think?

Historically NVIDIA has been secretive about the microarchitectural details of their GPUs. While this is annoying to folks with a deep technical interest and a hindrance to ninja-level CUDA optimizers, the people in charge probably feel (speculation!) that this stance best furthers NVIDIA’s business interests, which based on observation is likely true.

Some initially non-public information is reverse engineered by third parties, other “leaks out” when it is necessary for NVIDIA to discuss certain of their tools, such as the CUDA profiler. For example, the availability of SASS disassembly (debugger, cuobjdump) lead NVIDIA to publish the architecture-specific instruction sets, but this was done in the most rudimentary fashion omitting all details. A likely place to find nuggets of information revealed either way is in presentations given at GPU Developer Conferences.

Topic		Replies	Views
Behavior of L1/L2 caches CUDA Programming and Performance	1	475	June 2, 2023
Global memory access patterns - too slow CUDA Programming and Performance cuda , performance	6	1314	April 7, 2024
Dram_sectors_read.sum cofusing in Nsight Compute CUDA Programming and Performance cuda , nsight	5	45	February 13, 2025
The granularity of L1 and L2 caches CUDA Programming and Performance cuda	2	1238	April 18, 2024
Pascal L1 cache CUDA Programming and Performance	21	11954	January 20, 2024
L2 sectors question! Nsight Compute	2	135	June 27, 2025
What is the expected L1/L2 hit rate for fully coalesced accesses? CUDA Programming and Performance	10	137	January 8, 2025
Mismatch in L2 load miss and Device Memory loads Nsight Compute	2	436	March 20, 2024
L1/TEX cache Sector misses to L2 and L1/TEX load Sectors in L2 cache do not match Nsight Compute	1	503	July 20, 2023
Unexpected Data Read Behavior on Tesla V100: Cache Line and Memory Access Patterns Nsight Compute	2	523	August 31, 2023

L2 cache misses

Related topics