Hi,
Is there a sm__sass_data_bytes_mem_shared metric that includes TMA bytes?
I’d like to find the performance counter that includes bytes loaded into the SM from shared memory through TMA. I have been using sm__sass_data_bytes_mem_shared which seems to give values I’m expecting (ex. sm__sass_l1tex_m_xbar2l1tex_read_bytes_mem_global_op_ldgsts_cache_bypass shows 1.14 GB loaded into shared memory from L2, and sm__sass_data_bytes_mem_shared shows 5.67GB is loaded into the SM) for FlashAttention.
However, when I look at that metric for GEMMs that use TMA, shared memory bytes is 121.29 MB but I can see l1tex__m_xbar2l1tex_read_sectors_mem_global_op_tma_ld shows 5.28 GB is loaded from L2 via TMA into shared memory. I am expecting more like 15GB of data to be loaded into the SM.
Thanks!!