My GPU is L4, its whitepaper said tensor core FP16 peak performance is about 121T, but I use cutlass profiler tool and have not seen this performance.
# cmake .. -DCUTLASS_NVCC_ARCHS='89' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
# make cutlass_profiler -j16
# ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=4096 --n=4096 --k=4096
The best performace is as follows:
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: gemm
Operation: cutlass_tensorop_s16816gemm_f16_128x256_32x3_nt_align8
Status: Success
Verification: ON
Disposition: Passed
reference_device: Passed
cuBLAS: Not run
cuDNN: Not run
Arguments: --gemm_kind=universal --m=4096 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --D=f32:column \
--alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic \
--op_class=tensorop --accum=f32 --cta_m=128 --cta_n=256 --cta_k=32 --cluster_m=1 --cluster_n=1 --cluster_k=1 \
--stages=3 --warps_m=2 --warps_n=4 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024 \
Bytes: 134217728 bytes
FLOPs: 137472507904 flops
FLOPs/Byte: 1024
Runtime: 1.80919 ms
Memory: 69.0916 GiB/s
Math: 75985.5 GFLOP/s
So, my question is: “Is it possible to reach peak performance of gpu tensor core by using cutlass profiler or is that L4’s tensor core’s best performance, we can never see its peak performance in real application use?”