Is there any official benchmark tool to test a GPU's FLOPS?

spring_wind · October 22, 2023, 12:42am

I want to test a GPU’s FLOPS including CUDA core and Tensor core, for cublas or cutlass libraries, do they provide such easy-to-use tool to test FLOPS quickly? I noticed that there is a cutlass profiler, I don’t know whether it is accurate, any experience is very appreciated!

mnicely · October 23, 2023, 2:22pm

CUTLASS Profiler is pretty accurate. You can also check out NVBench for timing, but you’ll need to do the GFLOPs math.

spring_wind · October 24, 2023, 2:31am

My GPU is L4, its whitepaper said tensor core FP16 peak performance is about 121T, but I use cutlass profiler tool and have not seen this performance.

# cmake .. -DCUTLASS_NVCC_ARCHS='89' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
# make cutlass_profiler -j16
# ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=4096 --n=4096 --k=4096

The best performace is as follows:

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_s16816gemm_f16_128x256_32x3_nt_align8

          Status: Success
    Verification: ON
     Disposition: Passed

reference_device: Passed
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=4096 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --D=f32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=256 --cta_k=32 --cluster_m=1 --cluster_n=1 --cluster_k=1  \
                  --stages=3 --warps_m=2 --warps_n=4 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024  \


           Bytes: 134217728  bytes
           FLOPs: 137472507904  flops
           FLOPs/Byte: 1024

         Runtime: 1.80919  ms
          Memory: 69.0916 GiB/s

            Math: 75985.5 GFLOP/s

So, my question is: “Is it possible to reach peak performance of gpu tensor core by using cutlass profiler or is that L4’s tensor core’s best performance, we can never see its peak performance in real application use?”

Robert_Crovella · October 24, 2023, 4:13pm

You won’t be able to achieve that performance on L4. There are a few reasons for this.

No GPU delivers peak theoretical throughput.
The L4 has a power limitation (~70W) that constrains its ability to approach peak theoretical. All GPUs have this phenomenon more or less (they tend to go into a power-limiting state when performing large/continuous/repeated GEMM type operations). You can confirm this using nvidia-smi -a while the test is running and looking at the “clocks throttle reasons” section. There are many reports like this, here is one for T4. Here is another recent report on A10.

Topic		Replies	Views
About GPU peak performance CUDA Programming and Performance	6	1521	August 29, 2023
Finding the theoretical FLOPS of an OpenCL device Is there a way to find the theoretical maximum FLO CUDA Programming and Performance	6	2240	August 18, 2011
peak computational throughput CUDA Programming and Performance	3	896	December 24, 2015
flops calculation by profiler / of maximum CUDA Programming and Performance	6	14272	August 7, 2008
Calculating peak FP64 given cudaGetDeviceProperties CUDA Programming and Performance	6	29	January 31, 2025
Reduced CuBLAS performance on a particular problem size? GPU-Accelerated Libraries	0	426	October 13, 2020
Cublas and tflop measure, is it possible at all to measure tflop to any reasonable degree of accuracy? CUDA Programming and Performance	2	766	May 14, 2023
How to measure the performance of a GPU? CUDA Programming and Performance	2	1014	December 3, 2018
Some confuse about TX1 and TX2 FLOPS calculation CUDA Programming and Performance	4	5250	May 31, 2019
Question about GPU FLops CUDA Programming and Performance cuda , kernel	5	65	November 19, 2024

Is there any official benchmark tool to test a GPU's FLOPS?

Related topics