Understanding the functioning of nvprof and .cv data load option


I wanted to disable both L1 and L2 caches for data read accesses. The ptx guide [url]PTX ISA :: CUDA Toolkit Documentation says that .cv option does the following:

"Cache as volatile (consider cached system memory lines stale, fetch again).

The ld.cv load cached volatile operation applied to a global System Memory address invalidates (discards) a matching L2 line and re-fetches the line on each new load, to allow the thread program to poll a SysMem location written by the CPU. A ld.cv to a frame buffer DRAM address is the same as ld.cs, evict-first."

To achieve this effect, I compile my program with -Xptxas -dlcm=cv option with CUDA 6.5 on K40c. However, when I profile the code using nvprof, I see the L1 Global Hit Rate to be 0, but the metric l2_l1_read_hit_rate (L2 Hit Rate (L1 Reads)) shows to be 100.00%. Why should this happen? If the accesses are going through to the device memory, shouldn’t the L2 hit rate for L1 read accesses be 0 as well?

The number of L2 read requests from L1 is also not shown to be 0 by nvprof. Can someone please clarify this? I came across similar threads on this forum, but this particular question doesn’t seem to have come up.


AFAIK Tesla K40c would not use L1 for global loads by default anyway:


As indicated in the link above, however, local loads can flow through the L1 (e.g. in the event of register spilling, or large scale usage of local data, or local data which is indexed but for which the compiler cannot determine the necessary indexing at compile time, etc.)

AFAIK the -dlcm=xx option affects global loads, not local loads:


When I review the available metrics:


I see the following metrics, not sure if these are the ones you are specifying:

l1_cache_global_hit_rate:     Hit rate in L1 cache for global loads
l1_cache_local_hit_rate:      Hit rate in L1 cache for local loads and stores
l2_l1_read_hit_rate:          Hit rate at L2 cache for all read requests from L1 cache

I’m unable to locate a metric in the docs called “L1 Read Hit Ratio”

So you might clarify if you are in fact using l1_cache_global_hit_rate or some other metric. Identifying the exact nvprof command line you are using might help.

If you are using l1_cache_global_hit_rate, then the immediate question/possibility is that the l1 cache global hit rate is zero (as expected) but there may be local activity flowing through the L1 and L2 cache, which would show up in the l2_l1_read_hit_rate metric, but not the l1_cache_global_hit_rate metric.

@txbob : You are right, the metric name is “L1 Global Hit Rate”. I have changed it in the original post as well. Also, the L1 would not be accessed for read-only accesses by default, instead, the read-only cache (texture cache in Fermi) would be used in the Kepler architecture.

In my case, l1_cache_local_hit_rate is 0 as well. If there is a local activity from L1 to L2, wherein no global accesses are made, then I would expect some non-zero hit ratio on the local data in L1, especially since there is barely any local data in my application.

Also, I still want to know for sure if the .cv option really bypasses the L2 cache or not.

I stand by my original statement. The L1 cache would not be used for global traffic, by default, on any Kepler cc3.0 or cc3.5 device. This is covered in the documentation link I provided. The statement is not restricted to the read-only case.

@txbob: I agree, even for global writes, only the L2 cache would be accessed. Now, the question is, if .cv option is passed to ptxas, why should the l2_l1_read_hit_rate (L2 Hit Rate (L1 Reads)) be 100%? As I mentioned earlier, there is no local data in my application (ptxas -v shows no spills).

It is not possible to disable L1 or L2 on the NVIDIA GPUs.

All global memory access (with exception to LDG through the texture unit) go through the L1 cache to the L2 cache. The cache operator is a per instruction modifier that tells the L1 and L2 the requested cache policy.

All access to GPU device memory (on board DRAM) go through L2 so there is never a reason to support a uncached L2 access. However, accesses that go to system memory have to support this modifier so that you can have a coherent view of system memory with the CPU, GPU, or other clients.

If you change your global memory to system memory and issue the same kernel you should have a hitrate of 0%. If you do not see this then we can file a bug.

@Greg : Thanks! Pardon my unawareness, but what is the system memory that you are referring to? Do you mean the CPU memory? If so, with the .cv modifier, all read accesses must discard the L2 value (if present) and fetch the data from the system memory, right? If I were to edit the ptx code and force a .cv modifier on a particular data resident in the global memory, what behaviour would that lead to?

System memory is the term used in the profiling tools for the CPU DRAM. The function cudaHostAlloc can be used to allocate pinned system memory that is accessible from the GPU.

Global memory refers to a virtual address space whose physical address can be in GPU device memory or system memory. The L2 has a coherent view of device memory and there is no way to access GPU device memory except through the L2 so the .CV cache modifier is meaningless. However, if the global memory address is mapped to system memory and the address is resident in L2 then the line is flushed if dirty and invalidated and the value is read from the system memory over the PCIe bus.

@Greg : Thanks! That settles most of my issues. Just one last bit: if CPU changes the value of a device memory address on the GPU while a kernel is in flight (I am assuming no UVM use here), then would the L2 still retain its behaviour of the coherent view of the device memory?