L2 hit rate always at 100%

Hello,

I am profiling a simple application that stores a constant value to a matrix. I am using it on a Jetson AGX Xavier and with the NCU 2022.2.1.0 version. The memory access is coalesced (using the absolute index of the thread and not the threadidx only). Since the L2 cache on the Xavier has a size of 512 KB, I chose to have bigger size of array(1MB,10MB…). But I am having always a L2 cache hit of 100% using ncu.

__global__ void
memoryKernelSingleSM(volatile unsigned int* d_matrix, int *d_result) {

	int ind = blockIdx.x * blockDim.x + threadIdx.x; 
	volatile unsigned int r_sum; 
	r_sum = 0;

	#if RW==0
	r_sum = d_matrix[ind*ELEMENTS_PER_STRIDE];
	#elif RW==1
	d_matrix[ind] = 7;
	#endif	

}

This example is with 1MB:

with 2MB

Why I am having this despite having a matrix which size is bigger than the cache size.

N.B: the number of threads is enough to access each element of the matrix (4B per element) and the SASS is displaying the store instruction.

Thank you for your support

In L2, the cache policy is set so that all stores are considered hits. Only loads will cause misses. From the chart, it looks like only stores are occurring. Does that answer your question?

1 Like

Yes thank you but what is the L2 cache policy? is it a write-through policy with no allocation on write miss?

You may want to check this either

https://forums.developer.nvidia.com/t/l2-cache-in-a100-provides-179-hit-rate

The cache policy for writes in L2 is write-back by default.

1 Like

In that case, why a write-back policy gives a 100% cache hit? Since in a write-back policy, the cache line is written back to the memory only when the line is evicted. So in that case, we shouldn’t have any data written back to memory if we have 100% cache hit.

For stores, the metrics are defined as always counting a hit because (I’m generalizing here) from the instruction’s perspective, it writes to L2 and returns. Whether that causes an eviction or not, the store is unaware, and it does not impact the store instruction. For that reason, it was chosen to count the stores this way. It will always count the stores as a 100% hit rate, but that does not imply that no data is written back to memory.

1 Like

Great thank you.

Is there any documentation that illustrate how the metrics are defined or calculated?

One last thing: The %peak represents what actually? I am asking because for example sometimes I get 7GB/s as throughput (from the SoC memory to L2 cache) but the %peak is equal to 60% which does not represent 60% of 136.5GB/s which is the maximum bandwidth of memory bus.

The best place to look for metric definitions is the Kernel Profiling Guide although it may not have all the information you’re looking for.

With respect to the throughput, can you share what specific metric you’re looking at to get the 7GB/s and 60% of peak, and also where the 136.5GB/s number is coming from? In general, the percentage of peak is defined as utilized percentage of the hardware’s peak performance so your assumption seems correct, but we would need to dig deeper.

Hello,

Here is an example:

As we can see for a 6.48 GB/s, I am getting a 60% peak.

For the maximum bandwidth, I am refering to this manual (page 11, table 3):

Thank you.