I was using dcgmproftester11 on GCP to count flops on an A100 GPU 40GB SXM.
- instance type: a2-highgpu-1
- cuda version in VM: 11.2
- cuda version in dcgm container: 11.4
- dcgm image:
- region: europe-west-4
|Specification values||Results obtained on GCP||Difference|
|fp16||78 TFLOPS||24 TFLOPS||69.23%|
|fp32||19.5 TFLOPS||13 TFLOPS||33.33%|
|fp64||9.7 TFLOPS||6-7 TFLOPS||14.49%|
|fp16 Tensor Cores||312 TFLOPS||155-160 TFLOPS||48.71%|
spec values from: https://www.nvidia.com/en-us/data-center/a100/
The example from dcgm docs (see Feature Overview — NVIDIA DCGM Documentation latest documentation) shows how to generate load for TensorCores on A100 for 30 seconds. The A100 is able to achieve close to 253 TFLOPs of FP16 performance using the TensorCores. But the specification value for Tensor Cores FP16 is 312 TFLOPS, which is almost 19% difference.
I understand that I will not get to the spec results, and that it might be an issue on GCP’s end, but still the differences are too big, especially for FP16.
Is dcgmproftester suitable for FP16 FLOPs counting? What might cause such big differences?