Is there any official benchmark tool to test a GPU's FLOPS?

spring_wind · October 24, 2023, 2:31am

My GPU is L4, its whitepaper said tensor core FP16 peak performance is about 121T, but I use cutlass profiler tool and have not seen this performance.

# cmake .. -DCUTLASS_NVCC_ARCHS='89' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
# make cutlass_profiler -j16
# ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=4096 --n=4096 --k=4096

The best performace is as follows:

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_s16816gemm_f16_128x256_32x3_nt_align8

          Status: Success
    Verification: ON
     Disposition: Passed

reference_device: Passed
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=4096 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --D=f32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=256 --cta_k=32 --cluster_m=1 --cluster_n=1 --cluster_k=1  \
                  --stages=3 --warps_m=2 --warps_n=4 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024  \


           Bytes: 134217728  bytes
           FLOPs: 137472507904  flops
           FLOPs/Byte: 1024

         Runtime: 1.80919  ms
          Memory: 69.0916 GiB/s

            Math: 75985.5 GFLOP/s

So, my question is: “Is it possible to reach peak performance of gpu tensor core by using cutlass profiler or is that L4’s tensor core’s best performance, we can never see its peak performance in real application use?”

Topic		Replies	Views
About GPU peak performance CUDA Programming and Performance	6	1661	August 29, 2023
Finding the theoretical FLOPS of an OpenCL device Is there a way to find the theoretical maximum FLO CUDA Programming and Performance	6	2256	August 18, 2011
peak computational throughput CUDA Programming and Performance	3	916	December 24, 2015
flops calculation by profiler / of maximum CUDA Programming and Performance	6	14286	August 7, 2008
Calculating peak FP64 given cudaGetDeviceProperties CUDA Programming and Performance	6	63	January 31, 2025
Reduced CuBLAS performance on a particular problem size? GPU-Accelerated Libraries	0	430	October 13, 2020
Cublas and tflop measure, is it possible at all to measure tflop to any reasonable degree of accuracy? CUDA Programming and Performance	2	790	May 14, 2023
How to measure the performance of a GPU? CUDA Programming and Performance	2	1050	December 3, 2018
Some confuse about TX1 and TX2 FLOPS calculation CUDA Programming and Performance	4	5267	May 31, 2019
Question about GPU FLops CUDA Programming and Performance cuda , kernel	5	88	November 19, 2024

Is there any official benchmark tool to test a GPU's FLOPS?

Related topics