I want to use nsight compute to measure the number of shared memory load/stores.
These could be implicit shared memory loads from wgmma instructions OR explicit loads/stores through lds/sts instructions etc.
Right now, I see the following relevant metrics:
smsp__sass_inst_executed_op_shared_ld.sum
smsp__sass_inst_executed_op_shared_st.sum
smsp__inst_executed_op_shared_stsm.sum
These are instruction count metrics. I am not sure what other metrics I should look at and also how I could convert these instruction count metrics into number of bytes loaded/stored.
Is there a shared memory equivalent to dram__bytes_read.sum/dram__bytes_written.sum that could work? I feel like there isn’t one. If not, what’s my best bet?
I would recommend that you collect with --set full or the equivalent UI option and check the memory chart and tables on the Details page. You should be able to find all relevant metrics there by inspecting the tooltips of the respective cells or links. Note that you may need to switch the focused section body to switch from the memory chart to the memory tables in the Memory Workload Analysis section.
I did look in the counters detailed in full, but couldn’t find the relevant things in the memory tables in the Memory Workload Analysis section.
An example is here:
This is a matmul kernel with wgmma instructions and in the shared memory section, I do not see any way to get the bytes loaded. I only see instructions/requests.
On the other hand, I was able to find these counters manually and I can see that my gmma instructions have loaded 1.65 TB from shared memory:
I have a bunch of such counters that I found through digging, but I still don’t know how many I am missing and what is the best way to integrate them to find the total number of loads/stores from shared memory.
Dear OP, I’m experiencing the same issue—may I kindly ask if you’ve managed to find a solution? Thank you very much!
Thank you!!! the gmma bytes solved my issue. I was using sm__sass_data_bytes_mem_shared.sum and assumed it included gmma bytes (bytes loaded from TMA for GEMMs) but it reported very low numbers. sm__sass_data_bytes_mem_shared_op_gmma.sum gives an accurate number of bytes I was expecting to see
I think the documentation is wrong because it says
sm__sass_data_bytes_mem_shared Counter byte # of shared memory bytes required for LDS, LD, STS, ST, ATOMS, ATOM,
LDSM, LDGSTS, STSM, HGMMA, QGMMA, IGMMA, BGMMA
sm__sass_data_bytes_mem_shared_op_gmma Counter byte # of shared memory bytes required for HGMMA, IGMMA, QGMMA, BGMMA
but I can clearly see that sometimes mem_shared is like 200MB and mem_shared_op_gmma is 15GB, but for other kernels mem_shared is like 5GB and mem_shared_op_gmma is 0