Apologies, I am new to GPU profiling.
I am using Nsight Compute to profile matrix multiplications on Ampere A100 GPU.
I want to know if it’s possible to get the kernel mapping inside the GPU memory hierarchy. Specifically, I want to get the actual nested loops of the gemm kernel running on the GPU hardware that shows the mapping of data on different memory levels.
Is that possible using Nsight Compute? If not, is there any tool that can do that?
Can you clarify what you mean by mapping of data on different memory levels? A piece of data can move through the memory levels, for example it could sometimes reside in L1, then be evicted to L2, and finally evicted all the way to the device memory.
I’m not sure I understand what you mean by “the kernel mapping inside the GPU memory hierarchy”
Thanks a lot for your response! Sorry for the late response.
I think my question wasn’t clear as well. Here is what I was asking:
I would like to get tiling details in L2, SMEM, and RF. Also, the loop ordering which defines which matrix/dimension (m,n,k) in a GEMM will reside more in L2, SMEM, and RF. Also, which dimensions are spatially unrolled in the memory heirarchy.
Can Nsight-Compute provide such analysis?
Nsight Compute doesn’t have information about where specific pieces of data are mapped/tiled or how they are laid out in memory. It only observes the effects on performance of the mapping. There are some additional details about bank conflicts etc… for shared memory and I recommend this GTC talk to learn about that in detail. How to Understand and Optimize Shared Memory Accesses using Nsight Compute | NVIDIA On-Demand
Thanks for the answer and suggestions!
Is there any other tool that can give that information?
I don’t know of a tool that can give that information.