How to measure if the progress is limited by bandwidth?

Hi, I’m using A800 to do tensor core mma computation, and it doesn’t perform well on some occasions. It seems that it is limited by the bandwidth. I’m doing a m16n8k8 size mma, and in each computation, I use 4 warps to pick up a dense matrix which has 128 feature dims.
So, what can I do to figure out whether it is limited by the bandwidth?
Any help would be so appreciated~

Profilers are great tools for finding the bottlenecks in code. Highly recommended. NVIDIA even provides multiple flavors:

Thank you soooo much~, and is there any way to obtain the therotical bandwith needed in the progress?

First you calculate the speed of your Tensor cores. You just gave the matrix size, but not the data type. m16n8k8 is available for FP64, FP16, BF16, TF32. For FP16 it can make additionally a performance difference, whether you accumulate as FP16 or FP32.

The speed depends on the GPU and the clock frequency, too.

Then you calculate the input and output data size for each operation and consider, whether it is optimized by caches, e.g. if you reuse one of the matrix operands for multiple operations.