How do I get the number of shared memory bytes loaded/stored in a kernel

I want to use nsight compute to measure the number of shared memory load/stores.

These could be implicit shared memory loads from wgmma instructions OR explicit loads/stores through lds/sts instructions etc.

Right now, I see the following relevant metrics:
smsp__sass_inst_executed_op_shared_ld.sum
smsp__sass_inst_executed_op_shared_st.sum
smsp__inst_executed_op_shared_stsm.sum

These are instruction count metrics. I am not sure what other metrics I should look at and also how I could convert these instruction count metrics into number of bytes loaded/stored.

Is there a shared memory equivalent to dram__bytes_read.sum/dram__bytes_written.sum that could work? I feel like there isn’t one. If not, what’s my best bet?

I would recommend that you collect with --set full or the equivalent UI option and check the memory chart and tables on the Details page. You should be able to find all relevant metrics there by inspecting the tooltips of the respective cells or links. Note that you may need to switch the focused section body to switch from the memory chart to the memory tables in the Memory Workload Analysis section.

I did look in the counters detailed in full, but couldn’t find the relevant things in the memory tables in the Memory Workload Analysis section.
An example is here:

This is a matmul kernel with wgmma instructions and in the shared memory section, I do not see any way to get the bytes loaded. I only see instructions/requests.

On the other hand, I was able to find these counters manually and I can see that my gmma instructions have loaded 1.65 TB from shared memory:


I have a bunch of such counters that I found through digging, but I still don’t know how many I am missing and what is the best way to integrate them to find the total number of loads/stores from shared memory.