How to measure if the progress is limited by bandwidth?

hyaloids · July 23, 2024, 7:46am

Hi, I’m using A800 to do tensor core mma computation, and it doesn’t perform well on some occasions. It seems that it is limited by the bandwidth. I’m doing a m16n8k8 size mma, and in each computation, I use 4 warps to pick up a dense matrix which has 128 feature dims.
So, what can I do to figure out whether it is limited by the bandwidth?
Any help would be so appreciated~

njuffa · July 23, 2024, 7:58am

Profilers are great tools for finding the bottlenecks in code. Highly recommended. NVIDIA even provides multiple flavors:

hyaloids · July 23, 2024, 8:00am

Thank you soooo much~, and is there any way to obtain the therotical bandwith needed in the progress?

Curefab · July 23, 2024, 9:49am

First you calculate the speed of your Tensor cores. You just gave the matrix size, but not the data type. m16n8k8 is available for FP64, FP16, BF16, TF32. For FP16 it can make additionally a performance difference, whether you accumulate as FP16 or FP32.

The speed depends on the GPU and the clock frequency, too.

Then you calculate the input and output data size for each operation and consider, whether it is optimized by caches, e.g. if you reuse one of the matrix operands for multiple operations.

Topic		Replies	Views
Why is matrix multiplication quite slow and all hardware seems to be only half-used? CUDA Programming and Performance cuda	11	641	November 4, 2024
Why my program bandwidth exceeds the standard bandwidth? CUDA Programming and Performance	6	1070	April 3, 2015
Program presents higher bandwidth than theoretical maximum CUDA Programming and Performance	0	383	July 8, 2019
Bandwidth limited, Latency limited and Compute limited Need examples for each case CUDA Programming and Performance	1	6531	March 17, 2010
Effective Bandwidth Problem CUDA Programming and Performance	13	7849	March 23, 2011
measuring used memory bandwidth CUDA Programming and Performance	0	4511	August 12, 2010
Bandwidth measurement Theortical bandwidth vs BandwidthTest(SDK) results CUDA Programming and Performance	4	1648	May 30, 2011
Measuring Kernel Bandwidth CUDA Programming and Performance	6	2430	September 21, 2010
Confusion about NSight Compute profiler results Nsight Compute cuda , kernel , nvbugs	1	561	June 5, 2020
Speed-up and bandwidth CUDA Programming and Performance	12	9935	May 4, 2008

How to measure if the progress is limited by bandwidth?

Related topics