Large difference between dcgmproftester and specs

I was using dcgmproftester11 on GCP to count flops on an A100 GPU 40GB SXM.

  • instance type: a2-highgpu-1
  • cuda version in VM: 11.2
  • cuda version in dcgm container: 11.4
  • dcgm image: nvcr.io/nvidia/cloud-native/dcgm:2.2.9-ubuntu20.04
  • region: europe-west-4
Specification values Results obtained on GCP Difference
fp16 78 TFLOPS 24 TFLOPS 69.23%
fp32 19.5 TFLOPS 13 TFLOPS 33.33%
fp64 9.7 TFLOPS 6-7 TFLOPS 14.49%
fp16 Tensor Cores 312 TFLOPS 155-160 TFLOPS 48.71%

spec values from: https://www.nvidia.com/en-us/data-center/a100/

The example from dcgm docs (see Feature Overview — NVIDIA DCGM Documentation latest documentation) shows how to generate load for TensorCores on A100 for 30 seconds. The A100 is able to achieve close to 253 TFLOPs of FP16 performance using the TensorCores. But the specification value for Tensor Cores FP16 is 312 TFLOPS, which is almost 19% difference.

I understand that I will not get to the spec results, and that it might be an issue on GCP’s end, but still the differences are too big, especially for FP16.

Is dcgmproftester suitable for FP16 FLOPs counting? What might cause such big differences?

Based on the info you provided, I believe the issue could be due to virtualization in GCP compute instances, this can introduce substantial overhead. Either this, or dcgmproftester is bad at benchmarking. You can try implementing your own benchmark, doing matrix multiplications should saturate the compute and provide good metrics.