Thanks Kaka, I understand now. You can use the nvprof CUDA profiler tool to capture the Tensor Core usage while your application runs. nvprof supports two metrics for Tensor Core utilization:
- tensor_precision_fu_utilization: The utilization level of the multiprocessor function units that execute floating-point tensor core instructions on a scale of 0 to 10
- tensor_int_fu_utilization: The utilization level of the multiprocessor function units that execute int8 tensor core instructions on a scale of 0 to 10
Here is an example output of running it on the HMMA cudaTensorCoreGemm sample:
$ sudo /usr/local/cuda/bin/nvprof --kernels compute_gemm --metrics tensor_precision_fu_utilization,tensor_int_fu_utilization ./cudaTensorCoreGemm
Initializing...
==24384== NVPROF is profiling process 24384, command: ./cudaTensorCoreGemm
GPU Device 0: "Xavier" with compute capability 7.2
M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm
==24384== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "compute_gemm(__half const *, __half const *, float const *, float*, float, float)" (done)
Time: 1086.695679 msvents
TFLOPS: 0.13
==24384== Profiling application: ./cudaTensorCoreGemm
==24384== Profiling result:
==24384== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Xavier (0)"
Kernel: compute_gemm(__half const *, __half const *, float const *, float*, float, float)
1 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (5) Mid (5) Mid (5)
1 tensor_int_fu_utilization Tensor-Int Function Unit Utilization Idle (0) Idle (0) Idle (0)
Note that in this example, tensor_int_fu_utilization metric is shown as idle, because the sample uses HMMA FP16 operations and not IMMA INT8.