Is there any official benchmark tool to test a GPU's FLOPS?

My GPU is L4, its whitepaper said tensor core FP16 peak performance is about 121T, but I use cutlass profiler tool and have not seen this performance.

# cmake .. -DCUTLASS_NVCC_ARCHS='89' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
# make cutlass_profiler -j16
# ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=4096 --n=4096 --k=4096

The best performace is as follows:

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_s16816gemm_f16_128x256_32x3_nt_align8

          Status: Success
    Verification: ON
     Disposition: Passed

reference_device: Passed
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=4096 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --D=f32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=256 --cta_k=32 --cluster_m=1 --cluster_n=1 --cluster_k=1  \
                  --stages=3 --warps_m=2 --warps_n=4 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024  \


           Bytes: 134217728  bytes
           FLOPs: 137472507904  flops
           FLOPs/Byte: 1024

         Runtime: 1.80919  ms
          Memory: 69.0916 GiB/s

            Math: 75985.5 GFLOP/s

So, my question is: “Is it possible to reach peak performance of gpu tensor core by using cutlass profiler or is that L4’s tensor core’s best performance, we can never see its peak performance in real application use?”