I want to test a GPU’s FLOPS including CUDA core and Tensor core, for cublas or cutlass libraries, do they provide such easy-to-use tool to test FLOPS quickly? I noticed that there is a cutlass profiler, I don’t know whether it is accurate, any experience is very appreciated!
CUTLASS Profiler is pretty accurate. You can also check out NVBench for timing, but you’ll need to do the GFLOPs math.
My GPU is L4, its whitepaper said tensor core FP16 peak performance is about 121T, but I use cutlass profiler tool and have not seen this performance.
# cmake .. -DCUTLASS_NVCC_ARCHS='89' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
# make cutlass_profiler -j16
# ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=4096 --n=4096 --k=4096
The best performace is as follows:
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: gemm
Operation: cutlass_tensorop_s16816gemm_f16_128x256_32x3_nt_align8
Status: Success
Verification: ON
Disposition: Passed
reference_device: Passed
cuBLAS: Not run
cuDNN: Not run
Arguments: --gemm_kind=universal --m=4096 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --D=f32:column \
--alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic \
--op_class=tensorop --accum=f32 --cta_m=128 --cta_n=256 --cta_k=32 --cluster_m=1 --cluster_n=1 --cluster_k=1 \
--stages=3 --warps_m=2 --warps_n=4 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024 \
Bytes: 134217728 bytes
FLOPs: 137472507904 flops
FLOPs/Byte: 1024
Runtime: 1.80919 ms
Memory: 69.0916 GiB/s
Math: 75985.5 GFLOP/s
So, my question is: “Is it possible to reach peak performance of gpu tensor core by using cutlass profiler or is that L4’s tensor core’s best performance, we can never see its peak performance in real application use?”
You won’t be able to achieve that performance on L4. There are a few reasons for this.
- No GPU delivers peak theoretical throughput.
- The L4 has a power limitation (~70W) that constrains its ability to approach peak theoretical. All GPUs have this phenomenon more or less (they tend to go into a power-limiting state when performing large/continuous/repeated GEMM type operations). You can confirm this using
nvidia-smi -a
while the test is running and looking at the “clocks throttle reasons” section. There are many reports like this, here is one for T4. Here is another recent report on A10.