Large difference between dcgmproftester and specs

diana.gaponcic · December 14, 2022, 6:19pm

I was using dcgmproftester11 on GCP to count flops on an A100 GPU 40GB SXM.

instance type: a2-highgpu-1
cuda version in VM: 11.2
cuda version in dcgm container: 11.4
dcgm image: nvcr.io/nvidia/cloud-native/dcgm:2.2.9-ubuntu20.04
region: europe-west-4

	Specification values	Results obtained on GCP	Difference
fp16	78 TFLOPS	24 TFLOPS	69.23%
fp32	19.5 TFLOPS	13 TFLOPS	33.33%
fp64	9.7 TFLOPS	6-7 TFLOPS	14.49%
fp16 Tensor Cores	312 TFLOPS	155-160 TFLOPS	48.71%

spec values from: https://www.nvidia.com/en-us/data-center/a100/

The example from dcgm docs (see Feature Overview — NVIDIA DCGM Documentation latest documentation) shows how to generate load for TensorCores on A100 for 30 seconds. The A100 is able to achieve close to 253 TFLOPs of FP16 performance using the TensorCores. But the specification value for Tensor Cores FP16 is 312 TFLOPS, which is almost 19% difference.

I understand that I will not get to the spec results, and that it might be an issue on GCP’s end, but still the differences are too big, especially for FP16.

Is dcgmproftester suitable for FP16 FLOPs counting? What might cause such big differences?

not.your.unfriendly.neigh · December 26, 2022, 3:58pm

Based on the info you provided, I believe the issue could be due to virtualization in GCP compute instances, this can introduce substantial overhead. Either this, or dcgmproftester is bad at benchmarking. You can try implementing your own benchmark, doing matrix multiplications should saturate the compute and provide good metrics.

Topic		Replies	Views
Is there any official benchmark tool to test a GPU's FLOPS? GPU-Accelerated Libraries cublas , cutlass	3	3951	October 24, 2023
dense matrix-vector numbers CUDA Programming and Performance	3	775	July 16, 2010
TF32 TFLOPs of GeForce RTX 3090 vs A40 CUDA Programming and Performance	2	2225	September 11, 2023
Performance of A100 vs. V100s for mixed pression CUDA Programming and Performance	1	926	December 3, 2021
How A30 GPU is faster than A10 GPU? GPU-Accelerated Libraries gpu	3	5614	July 5, 2022
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	5324	August 14, 2024
Performance of GF10x GPU CUDA Programming and Performance	8	2634	April 24, 2013
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2495	August 12, 2017
Some confuse about TX1 and TX2 FLOPS calculation CUDA Programming and Performance	4	5236	May 31, 2019
Double precision tensor core performance on A100 CUDA Programming and Performance cuda , a100 , ampere	1	886	July 7, 2023

Large difference between dcgmproftester and specs

Related topics