Different about ncu metric

I am learning ncu command, but I’m confused about two metrics.

There are dram__bytes_read.sum.per_second and l1tex__t_bytes_pipe_lsu_mem_global_op_ld.

I don’t kown what the essential different between them, because i think the DRAM and the global memory are the same memory

In my test kernl, dram__bytes_read.sum.per_second metric always get lower bandwidth, and l1tex__t_bytes_pipe_lsu_mem_global_op_ld always get higher bandwidth.

These refer to different points in the memory hierarchy. As used here, DRAM and “global” do not refer to the same thing, or the same point in the memory hierarchy.

global refers to the green box labelled “global” in the upper left hand corner of that diagram. It represents requests to the logical global space, emanating from kernel code (warp instructions, such as LD).

DRAM refers to the physical entity on the right hand side of the diagram, labelled “device memory”.

As an example, you could have a high level of global traffic, most of which hits in the L2 cache, resulting in relatively little DRAM traffic.

A “decoder ring” that identifies a few of these memory metrics and the portion of the memory hierarchy they belong to, is contained in this introductory ncu blog.

Also, there is a separate forum for questions specific to nsight compute, and a “deeper-dive” 3-part blog series focused on nsight compute starting here.

