Is there any official benchmark tool to test a GPU's FLOPS?

I want to test a GPU’s FLOPS including CUDA core and Tensor core, for cublas or cutlass libraries, do they provide such easy-to-use tool to test FLOPS quickly? I noticed that there is a cutlass profiler, I don’t know whether it is accurate, any experience is very appreciated!

1 Like

CUTLASS Profiler is pretty accurate. You can also check out NVBench for timing, but you’ll need to do the GFLOPs math.

My GPU is L4, its whitepaper said tensor core FP16 peak performance is about 121T, but I use cutlass profiler tool and have not seen this performance.

# cmake .. -DCUTLASS_NVCC_ARCHS='89' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
# make cutlass_profiler -j16
# ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=4096 --n=4096 --k=4096

The best performace is as follows:

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_s16816gemm_f16_128x256_32x3_nt_align8

          Status: Success
    Verification: ON
     Disposition: Passed

reference_device: Passed
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=4096 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --D=f32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=256 --cta_k=32 --cluster_m=1 --cluster_n=1 --cluster_k=1  \
                  --stages=3 --warps_m=2 --warps_n=4 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024  \


           Bytes: 134217728  bytes
           FLOPs: 137472507904  flops
           FLOPs/Byte: 1024

         Runtime: 1.80919  ms
          Memory: 69.0916 GiB/s

            Math: 75985.5 GFLOP/s

So, my question is: “Is it possible to reach peak performance of gpu tensor core by using cutlass profiler or is that L4’s tensor core’s best performance, we can never see its peak performance in real application use?”

You won’t be able to achieve that performance on L4. There are a few reasons for this.

  1. No GPU delivers peak theoretical throughput.
  2. The L4 has a power limitation (~70W) that constrains its ability to approach peak theoretical. All GPUs have this phenomenon more or less (they tend to go into a power-limiting state when performing large/continuous/repeated GEMM type operations). You can confirm this using nvidia-smi -a while the test is running and looking at the “clocks throttle reasons” section. There are many reports like this, here is one for T4. Here is another recent report on A10.