I wanted to know how Nsight (2019.5.0) reports L2 cache counters in case of unified virtual memory i.e. memory allocated via cudaMallocManage. In my program, the array is allocated and initialized with values in the CPU, and on the first touch principle, it moves to GPU. Unlike cudaMalloc if the allocated memory is small there is a possibility of 100% L2 hit, cudaMallocManaged should report L2 cache misses as they are not in GPU memory unless used. However, for the program, Nsight Compute reports 99.71% L2 cache hit.
I am allocating a large virtual address space [4G], but accessing a small chunk of it [2M], just once. There are 19 loads and 20 stores. Out of this, I am expecting 18 misses in loads, making the L2 cache hit rate ~54%.
I don’t understand this.
Thanks in advance.
Driver version: 410.104
Runtime: 10.0 [I am hoping this version is not an issue]
GPU: Pascal, 1080Ti