Why is matrix multiplication quite slow and all hardware seems to be only half-used?

ch271828n · November 3, 2024, 12:08pm

Hi thanks for the ecosystem! I tried the very simple matrix multiplication below:

import torch
mat_size = 65536
a = torch.rand((mat_size, mat_size), device='cuda', dtype=torch.bfloat16)
torch.cuda.cudart().cudaProfilerStart()
for i in range(4):
    a = a @ a
print(a.sum())
torch.cuda.cudart().cudaProfilerStop()

And do a profiling:

nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s none -o /host_home/research/synced_temp/nsys_profile_output -f true --capture-range=cudaProfilerApi -x true --gpu-metrics-devices=all python myfile.py

And get this Nsight Systems result:

It seems no component is fully used. For example:

Tensor core: 50%
SM warps: only 17% (others are unallocated warps)
DRAM bandwidth: 9.9% (read) + 2.1% (write)

Thus I wonder what blocks the code from executing faster? Thanks for any suggestions!

I personally guess it may be because of some overhead, but wondering whether that can be as large as 50%…

Curefab · November 3, 2024, 1:18pm

Not sure about this case, but in the past we had for some GPUs the situations that tensor cores were exactly shown to be used by 50%, when they were at their maximum.

Reason was either

a wrong maximum in the GPU definitions of the Nsight tool
different occupancy levels for different parts of the Tensor Core pipelines (e.g. some parts were occupied 50%, others 100% and Nsight just shows the 50%, but the speed was limited by the 100%)
(perhaps some product differentiation and artificial slowdown?)

Could you compare the resulting speed of the computation with the maximum specified speed for matrix multiplications of your GPU for the used data type?

BF16 speed in FMA/SM/cycle for each architecture (to my best knowledge, no guarantee):

7.5 Desktop/8.6 Desktop/8.9 Desktop: 256
7.5 Workstation/8.9 Workstation: 512
8.0/8.6 Workstation/8.7: 1024
9.0: 2048
10.0: 4096

(BF16 support on 7.5 is quite unofficial)

Two 65536 x 65536 square matrices multiplied need 65536*65536*65536 FMA operations or 2^48, if this is, what your code does.

[edit: corrected the Ada generation values]

ch271828n · November 3, 2024, 1:35pm

@Curefab Thank you for the quick reply!

From the screenshot, each iteration takes 3.75s. Since each iter is 2^48 = 256T, thus the actual TFLOPS is 256/3.75 = 68.

This is a 4090D, thus the theoretical TFLOPS is a bit under 330.

Also, in GitHub - mli/transformers-benchmarks: real Transformer TeraFLOPS on various GPUs, people benchmarked 4090 and get 172 TFLOPS.

Hmm, so (1) their benchmark says actual = theoretical x 0.5, and (2) my code here shows actual = theoretical x 0.25.

~~Wait for a minute, I will debug this.~~
I see: that benchmark considers nxn matrix multiplication to have 2*n**3 FLOPS, while your suggestion is 1*n**3. So I wonder which one is correct?

Curefab · November 3, 2024, 1:54pm

I think I made a mistake for the 8.9 Desktop and 8.9 Workstation speed, they should be have of the stated numbers - 256 for Desktop.

See the 4080 in

TFLOPs numbers are double the FMA numbers, as FMA has multiplication + addition, which counts as 2 FLOPs.

The 4090 D has 114 SMs with a boost clock of 2520 MHz (if it is running at boost clock): That is 114 * 256 * 2520 million = 73.5 Tera-FMA/s, which fits to the 68 Tera-FMA/s you are experiencing.

And also the 147 TFLOPs fit approximately to the 172 TFLOPs.

ch271828n · November 3, 2024, 1:57pm

@Curefab Thank you for the insights! The it is interesting to see GeForce 40 series - Wikipedia saying: “tensor compute: 330”. How can that be possible :/

Curefab · November 3, 2024, 2:03pm

The Tensor Compute on this page is stated as TFLOPs for FP16 type with FP16 summation. The (theoretical) Tera-FMA/s are exactly a quarter of that value. For the remaining small difference: The RTX 4090 D has 114 instead of 128 SMs (RTX 4090).

ch271828n · November 3, 2024, 2:14pm

@Curefab Thank you! I am still a bit confused - why is it a quarter (instead of half)?

From the explanations above, I can understand 1 FMA = 2 FLOPs, because 1 fused multiply and add means 2 float operations.

EDIT: And there are some interesting numbers here: GitHub - mli/transformers-benchmarks: real Transformer TeraFLOPS on various GPUs

A100: theory = 312, actual mat mul = 230
V100: theory = 125, actual mat mul = 95

And they do not seem to be exactly halves.

Curefab · November 3, 2024, 2:32pm

Have a look here (Tensor Cores table) - the speed depends on the data type.
With 8.9 Desktop (for 4090 D) the speed for BF16 is half the speed of FP16 with FP16 accumulation (which is used in the Wikipedia table you linked).

So half because of data type, half because of FMA vs. FLOP => a quarter.

(I contributed to those numbers in Wikipedia from the specification datasheets)

ch271828n · November 3, 2024, 11:04pm

Ah I see, thank you so much!

ch271828n · November 4, 2024, 2:23am

@Curefab Hi, sorry but I cannot reproduce the “double speed” thing… When testing “fp16 x fp16 = fp16” matrix multiplication, it is as slow as “bf16 x bf16 = bf16” matrix multiplication. I would appreciate it if I could know where I am wrong.

Experiments (click below to expand):

Experiment: Two matrices of type bfloat16 or float16 do multiplication. They get same speed (instead of double speed).

More experiments to ensure I do “float16 x float16 and result in float16”:

Nsight:

Curefab · November 4, 2024, 12:19pm

You could use 1024x1024 matrix sizes so that the L2 caches are better used and look at Nsight Compute (instead of Nsight Systems) to see, what the reason for the slowdown is (e.g. memory bandwidth or compute)

ch271828n · November 4, 2024, 12:33pm

Thanks, I will have a look later.

Topic		Replies	Views
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	2462	May 15, 2024
matrix multiplication can't achieve peak performanc CUDA Programming and Performance	9	2311	April 19, 2012
Why different shape matrix multiplication have different performance? CUDA Programming and Performance	2	766	August 26, 2018
best possible matrix-vector multiplication performance? poor guy with only an emulator wonders about CUDA Programming and Performance	6	5601	August 12, 2009
matrix multiplication with large dimensions CUDA Programming and Performance	7	1584	April 9, 2011
Intermediate multiplications downgrades warp shuffling performance CUDA Programming and Performance cuda , kernel	7	242	July 24, 2023
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2546	August 12, 2017
Cuda matrix multiplication too slow CUDA Programming and Performance	5	13332	February 17, 2010
Dense vs Sparse Tensor Core Performance (FP16) CUDA Programming and Performance	10	665	December 5, 2024
Ada GeForce (RTX 4090) FP8 cuBLASLt performance GPU-Accelerated Libraries cublas	7	12423	November 2, 2023

Why is matrix multiplication quite slow and all hardware seems to be only half-used?

Related topics