L2 cache misses

Hello,

I was watching this webinar and I didn’t understand why when we have a 64 byte of stride we will have a double number of misses? I didn’t understand the idea that the L2 fetches the data from the memory at a granularity of 2 sectors. If it that the case, for a 32 stride we should have 16 misses and for a 64 byte stride we should have 32 misses.(starting 33min)

Last, I tried this on my Jetson Xavier, and I saw that after a stride of 32 I get Sector Misses to System constant.
Please if you need further information let me know.

Thank you for your support.

Whenever you reference video presentations, please mention the time mark at which the referenced content occurs. In this case it is at about 33 minutes.

Generally speaking sectored caches try to minimize tag storage while still allowing fine granularity of cache operations by storing status information (valid, dirty, etc.) per sector. This particular cache uses a cache line length of 128 bytes consisting of 4 sectors of 32 bytes each. However – and this strikes me as somewhat unusual – fetches from device memory occur at the granularity of two sectors. Why have four sectors then? Presumably writes actually happen with sector granularity.

So with an access stride of 64 bytes, each L2 fetch required involves the status data of two consecutive sectors: one for the sector the requested data is actually in, and one for the neighboring sector that it is paired with for L2 fetch purposes (one might think of this as a “phantom” miss caused by “overfetch” but one could question whether that is helpful for one’s mental model). Thus two misses are counted, one for each of the sectors involved. This keeps the profiling hardware and software simple which is tracking events by sector.

For an access stride of 32 there is a miss in every sector, all of which are “real” misses.

1 Like

Thank you a lot for your reply.
It was not clear in the video about the idea of phantom miss. About the granularity of 64 bytes it was also mentioned in the video, however I couldn’t find elsewhere this information.

I am not familiar with NVIDIA’s embedded products. Have you checked whether NVIDIA has a relevant architectural whitepaper and whether this information is in there? Jetson Xavier belongs to the Volta architecture family, I think?

Historically NVIDIA has been secretive about the microarchitectural details of their GPUs. While this is annoying to folks with a deep technical interest and a hindrance to ninja-level CUDA optimizers, the people in charge probably feel (speculation!) that this stance best furthers NVIDIA’s business interests, which based on observation is likely true.

Some initially non-public information is reverse engineered by third parties, other “leaks out” when it is necessary for NVIDIA to discuss certain of their tools, such as the CUDA profiler. For example, the availability of SASS disassembly (debugger, cuobjdump) lead NVIDIA to publish the architecture-specific instruction sets, but this was done in the most rudimentary fashion omitting all details. A likely place to find nuggets of information revealed either way is in presentations given at GPU Developer Conferences.

1 Like