TMA bytes from shared memory

Hi,

Is there a sm__sass_data_bytes_mem_shared metric that includes TMA bytes?

I’d like to find the performance counter that includes bytes loaded into the SM from shared memory through TMA. I have been using sm__sass_data_bytes_mem_shared which seems to give values I’m expecting (ex. sm__sass_l1tex_m_xbar2l1tex_read_bytes_mem_global_op_ldgsts_cache_bypass shows 1.14 GB loaded into shared memory from L2, and sm__sass_data_bytes_mem_shared shows 5.67GB is loaded into the SM) for FlashAttention.

However, when I look at that metric for GEMMs that use TMA, shared memory bytes is 121.29 MB but I can see l1tex__m_xbar2l1tex_read_sectors_mem_global_op_tma_ld shows 5.28 GB is loaded from L2 via TMA into shared memory. I am expecting more like 15GB of data to be loaded into the SM.

Thanks!!

The gmma bytes solved my issue. I was using sm__sass_data_bytes_mem_shared.sum and assumed it included gmma bytes (bytes loaded from TMA for GEMMs) but it reported very low numbers. sm__sass_data_bytes_mem_shared_op_gmma.sum gives an accurate number of bytes I was expecting to see (the 15 GB)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.