Why is matrix multiplication quite slow and all hardware seems to be only half-used?

Hi thanks for the ecosystem! I tried the very simple matrix multiplication below:

import torch
mat_size = 65536
a = torch.rand((mat_size, mat_size), device='cuda', dtype=torch.bfloat16)
torch.cuda.cudart().cudaProfilerStart()
for i in range(4):
    a = a @ a
print(a.sum())
torch.cuda.cudart().cudaProfilerStop()

And do a profiling:

nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s none -o /host_home/research/synced_temp/nsys_profile_output -f true --capture-range=cudaProfilerApi -x true --gpu-metrics-devices=all python myfile.py

And get this Nsight Systems result:

It seems no component is fully used. For example:

  • Tensor core: 50%
  • SM warps: only 17% (others are unallocated warps)
  • DRAM bandwidth: 9.9% (read) + 2.1% (write)

Thus I wonder what blocks the code from executing faster? Thanks for any suggestions!

I personally guess it may be because of some overhead, but wondering whether that can be as large as 50%…

Not sure about this case, but in the past we had for some GPUs the situations that tensor cores were exactly shown to be used by 50%, when they were at their maximum.

Reason was either

  • a wrong maximum in the GPU definitions of the Nsight tool
  • different occupancy levels for different parts of the Tensor Core pipelines (e.g. some parts were occupied 50%, others 100% and Nsight just shows the 50%, but the speed was limited by the 100%)
  • (perhaps some product differentiation and artificial slowdown?)

Could you compare the resulting speed of the computation with the maximum specified speed for matrix multiplications of your GPU for the used data type?

BF16 speed in FMA/SM/cycle for each architecture (to my best knowledge, no guarantee):

  • 7.5 Desktop/8.6 Desktop/8.9 Desktop: 256
  • 7.5 Workstation/8.9 Workstation: 512
  • 8.0/8.6 Workstation/8.7: 1024
  • 9.0: 2048
  • 10.0: 4096

(BF16 support on 7.5 is quite unofficial)

Two 65536 x 65536 square matrices multiplied need 65536*65536*65536 FMA operations or 2^48, if this is, what your code does.

[edit: corrected the Ada generation values]

@Curefab Thank you for the quick reply!

From the screenshot, each iteration takes 3.75s. Since each iter is 2^48 = 256T, thus the actual TFLOPS is 256/3.75 = 68.

This is a 4090D, thus the theoretical TFLOPS is a bit under 330.

Also, in GitHub - mli/transformers-benchmarks: real Transformer TeraFLOPS on various GPUs, people benchmarked 4090 and get 172 TFLOPS.

Hmm, so (1) their benchmark says actual = theoretical x 0.5, and (2) my code here shows actual = theoretical x 0.25.

Wait for a minute, I will debug this.
I see: that benchmark considers nxn matrix multiplication to have 2*n**3 FLOPS, while your suggestion is 1*n**3. So I wonder which one is correct?

I think I made a mistake for the 8.9 Desktop and 8.9 Workstation speed, they should be have of the stated numbers - 256 for Desktop.

See the 4080 in

TFLOPs numbers are double the FMA numbers, as FMA has multiplication + addition, which counts as 2 FLOPs.

The 4090 D has 114 SMs with a boost clock of 2520 MHz (if it is running at boost clock): That is 114 * 256 * 2520 million = 73.5 Tera-FMA/s, which fits to the 68 Tera-FMA/s you are experiencing.

And also the 147 TFLOPs fit approximately to the 172 TFLOPs.

@Curefab Thank you for the insights! The it is interesting to see GeForce 40 series - Wikipedia saying: “tensor compute: 330”. How can that be possible :/

The Tensor Compute on this page is stated as TFLOPs for FP16 type with FP16 summation. The (theoretical) Tera-FMA/s are exactly a quarter of that value. For the remaining small difference: The RTX 4090 D has 114 instead of 128 SMs (RTX 4090).

@Curefab Thank you! I am still a bit confused - why is it a quarter (instead of half)?

From the explanations above, I can understand 1 FMA = 2 FLOPs, because 1 fused multiply and add means 2 float operations.

EDIT: And there are some interesting numbers here: GitHub - mli/transformers-benchmarks: real Transformer TeraFLOPS on various GPUs

  • A100: theory = 312, actual mat mul = 230
  • V100: theory = 125, actual mat mul = 95

And they do not seem to be exactly halves.

Have a look here (Tensor Cores table) - the speed depends on the data type.
With 8.9 Desktop (for 4090 D) the speed for BF16 is half the speed of FP16 with FP16 accumulation (which is used in the Wikipedia table you linked).

So half because of data type, half because of FMA vs. FLOP => a quarter.

(I contributed to those numbers in Wikipedia from the specification datasheets)

Ah I see, thank you so much!

@Curefab Hi, sorry but I cannot reproduce the “double speed” thing… When testing “fp16 x fp16 = fp16” matrix multiplication, it is as slow as “bf16 x bf16 = bf16” matrix multiplication. I would appreciate it if I could know where I am wrong.

Experiments (click below to expand):

Experiment: Two matrices of type bfloat16 or float16 do multiplication. They get same speed (instead of double speed).


More experiments to ensure I do “float16 x float16 and result in float16”:

Nsight:

You could use 1024x1024 matrix sizes so that the L2 caches are better used and look at Nsight Compute (instead of Nsight Systems) to see, what the reason for the slowdown is (e.g. memory bandwidth or compute)

Thanks, I will have a look later.