How to confirm whether Tensor Core is working or not.

Hi Community member,

Please let me confirm the following question.

Would you please teach me the way to confirm whether the Tensor core is working or not.

Best regards.
Kaka

Hi Kaka, please refer to this topic: https://devtalk.nvidia.com/default/topic/1047176/jetson-agx-xavier/tenson-core-sample

Hi Dusty,

Thank you for your response.
But I could not find the answer for my question.

Again, how should we confirm whether the TensorCore is working or not?
Do you have any tools/status to check it?

Best regards.
Kaka

Hi Kaka, that sample runs using the Tensor Core HMMA operations, it wouldn’t run if Tensor Cores weren’t working properly for some reason (not that it should occur).

Hi Dusty,

Thank you for your support. I understood that if used the this sample code, the TensorCore will work.
But I would like to confirm whether the TensorCore is working or not in the case of making the original code and
Do you know the way to confirm it?

Best regards.
Kaka

Thanks Kaka, I understand now. You can use the nvprof CUDA profiler tool to capture the Tensor Core usage while your application runs. nvprof supports two metrics for Tensor Core utilization:

  • tensor_precision_fu_utilization: The utilization level of the multiprocessor function units that execute floating-point tensor core instructions on a scale of 0 to 10
  • tensor_int_fu_utilization: The utilization level of the multiprocessor function units that execute int8 tensor core instructions on a scale of 0 to 10

Here is an example output of running it on the HMMA cudaTensorCoreGemm sample:

$ sudo /usr/local/cuda/bin/nvprof --kernels compute_gemm --metrics tensor_precision_fu_utilization,tensor_int_fu_utilization ./cudaTensorCoreGemm
Initializing...
==24384== NVPROF is profiling process 24384, command: ./cudaTensorCoreGemm
GPU Device 0: "Xavier" with compute capability 7.2

M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm 
==24384== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "compute_gemm(__half const *, __half const *, float const *, float*, float, float)" (done)
Time: 1086.695679 msvents
TFLOPS: 0.13
==24384== Profiling application: ./cudaTensorCoreGemm
==24384== Profiling result:
==24384== Metric result:
Invocations                               Metric Name                           Metric Description         Min         Max         Avg
Device "Xavier (0)"
    Kernel: compute_gemm(__half const *, __half const *, float const *, float*, float, float)
          1           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Mid (5)     Mid (5)     Mid (5)
          1                 tensor_int_fu_utilization         Tensor-Int Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)

Note that in this example, tensor_int_fu_utilization metric is shown as idle, because the sample uses HMMA FP16 operations and not IMMA INT8.

Hi Dusty,

Thank you for your response and information. I got it!

Also we would like to confirm the TensorCore performance. Do you know any sample codes which we can set as enabling or disabling in order to compare it?

Best regards.
Kaka

If you are using TensorRT, TensorRT will automatically enable Tensor Cores, so they can’t be disabled other than by not using FP16 / FP32.

If you are using cuDNN directly, you can choose not to have layers executed on the Tensor Cores by not specifying CUDNN_TENSOR_OP_MATH: http://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor_ops

If you are using TensorFlow, it looks like there is a setting to disable Tensor Core usage in cuDNN: https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tf_disable_tensor_op_math

For cuBLAS, Tensor Cores are used through cublasGemmEx(), so use normal cublasGemm() function if you don’t want to use the Tensor Cores.