How to confirm whether Tensor Core is working or not.

Thanks Kaka, I understand now. You can use the nvprof CUDA profiler tool to capture the Tensor Core usage while your application runs. nvprof supports two metrics for Tensor Core utilization:

  • tensor_precision_fu_utilization: The utilization level of the multiprocessor function units that execute floating-point tensor core instructions on a scale of 0 to 10
  • tensor_int_fu_utilization: The utilization level of the multiprocessor function units that execute int8 tensor core instructions on a scale of 0 to 10

Here is an example output of running it on the HMMA cudaTensorCoreGemm sample:

$ sudo /usr/local/cuda/bin/nvprof --kernels compute_gemm --metrics tensor_precision_fu_utilization,tensor_int_fu_utilization ./cudaTensorCoreGemm
Initializing...
==24384== NVPROF is profiling process 24384, command: ./cudaTensorCoreGemm
GPU Device 0: "Xavier" with compute capability 7.2

M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm 
==24384== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "compute_gemm(__half const *, __half const *, float const *, float*, float, float)" (done)
Time: 1086.695679 msvents
TFLOPS: 0.13
==24384== Profiling application: ./cudaTensorCoreGemm
==24384== Profiling result:
==24384== Metric result:
Invocations                               Metric Name                           Metric Description         Min         Max         Avg
Device "Xavier (0)"
    Kernel: compute_gemm(__half const *, __half const *, float const *, float*, float, float)
          1           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Mid (5)     Mid (5)     Mid (5)
          1                 tensor_int_fu_utilization         Tensor-Int Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)

Note that in this example, tensor_int_fu_utilization metric is shown as idle, because the sample uses HMMA FP16 operations and not IMMA INT8.

1 Like