How to verify that Tensorflow w/ AMP is using tensor cores

I am trying to verify that my 2080 Ti tensor cores are being used when running AMP on the official Tensorflow Resnet benchmarks (due to there being a slowdown with AMP vs. standard fp32).

I have been running commands like:

v-nsight-cu-cli --target-processes all -k volta_sgemm_128x64_nt -c 1 --metrics ... python tf_benchmark_wrapper.py

But it’s not clear what metrics I should be paying attention to. The

sm__inst_executed_pipe_hmmafp32_sum

metric given at https://devblogs.nvidia.com/using-nsight-compute-nvprof-mixed-precision-deep-learning-models/ is not listed by

--list-metrics (Edit: I meant --query-metrics)

and I presume it’s been deprecated.

Passing in a large list of metrics, I see output like:

sm__inst_executed_pipe_tensor_op_hmma.max                                         inst                              0
    sm__inst_executed_pipe_tensor_op_hmma.max.pct_of_peak_burst_active                   %                              0
    sm__inst_executed_pipe_tensor_op_hmma.max.pct_of_peak_burst_elapsed                  %                              0
    sm__inst_executed_pipe_tensor_op_hmma.max.pct_of_peak_burst_frame                    %                              0
    sm__inst_executed_pipe_tensor_op_hmma.max.pct_of_peak_burst_region                   %                              0
    sm__inst_executed_pipe_tensor_op_hmma.max.pct_of_peak_sustained_active               %                              0
    sm__inst_executed_pipe_tensor_op_hmma.max.pct_of_peak_sustained_elapse               %                              0
    sm__inst_executed_pipe_tensor_op_hmma.max.pct_of_peak_sustained_frame                %                              0
    sm__inst_executed_pipe_tensor_op_hmma.max.pct_of_peak_sustained_region               %                              0
    sm__inst_executed_pipe_tensor_op_hmma.max.peak_burst                        inst/cycle                              4
    sm__inst_executed_pipe_tensor_op_hmma.max.peak_sustained                    inst/cycle                              1
    sm__inst_executed_pipe_tensor_op_hmma.max.per_cycle_active                  inst/cycle                              0
    sm__inst_executed_pipe_tensor_op_hmma.max.per_cycle_elapsed                 inst/cycle                              0
    sm__inst_executed_pipe_tensor_op_hmma.max.per_cycle_in_frame                inst/cycle                              0
    sm__inst_executed_pipe_tensor_op_hmma.max.per_cycle_in_region               inst/cycle                              0
    sm__inst_executed_pipe_tensor_op_hmma.max.per_second                       inst/second                              0
    sm__inst_executed_pipe_tensor_op_hmma.min                                         inst                              0
    sm__inst_executed_pipe_tensor_op_hmma.min.pct_of_peak_burst_active                   %                              0
    sm__inst_executed_pipe_tensor_op_hmma.min.pct_of_peak_burst_elapsed                  %                              0
    sm__inst_executed_pipe_tensor_op_hmma.min.pct_of_peak_burst_frame                    %                              0
    sm__inst_executed_pipe_tensor_op_hmma.min.pct_of_peak_burst_region                   %                              0
    sm__inst_executed_pipe_tensor_op_hmma.min.pct_of_peak_sustained_active               %                              0
    sm__inst_executed_pipe_tensor_op_hmma.min.pct_of_peak_sustained_elapse               %                              0
    sm__inst_executed_pipe_tensor_op_hmma.min.pct_of_peak_sustained_frame                %                              0
    sm__inst_executed_pipe_tensor_op_hmma.min.pct_of_peak_sustained_region               %                              0
    sm__inst_executed_pipe_tensor_op_hmma.min.peak_burst                        inst/cycle                              4
    sm__inst_executed_pipe_tensor_op_hmma.min.peak_sustained                    inst/cycle                              1
    sm__inst_executed_pipe_tensor_op_hmma.min.per_cycle_active                  inst/cycle                              0
    sm__inst_executed_pipe_tensor_op_hmma.min.per_cycle_elapsed                 inst/cycle                              0
    sm__inst_executed_pipe_tensor_op_hmma.min.per_cycle_in_frame                inst/cycle                              0
    sm__inst_executed_pipe_tensor_op_hmma.min.per_cycle_in_region               inst/cycle                              0
    sm__inst_executed_pipe_tensor_op_hmma.min.per_second                       inst/second                              0
    sm__inst_executed_pipe_tensor_op_hmma.sum                                         inst                              0
    sm__inst_executed_pipe_tensor_op_hmma.sum.pct_of_peak_burst_active                   %                              0
    sm__inst_executed_pipe_tensor_op_hmma.sum.pct_of_peak_burst_elapsed                  %                              0
    sm__inst_executed_pipe_tensor_op_hmma.sum.pct_of_peak_burst_frame                    %                              0
    sm__inst_executed_pipe_tensor_op_hmma.sum.pct_of_peak_burst_region                   %                              0
    sm__inst_executed_pipe_tensor_op_hmma.sum.pct_of_peak_sustained_active               %                              0
    sm__inst_executed_pipe_tensor_op_hmma.sum.pct_of_peak_sustained_elapse               %                              0
    sm__inst_executed_pipe_tensor_op_hmma.sum.pct_of_peak_sustained_frame                %                              0
    sm__inst_executed_pipe_tensor_op_hmma.sum.pct_of_peak_sustained_region               %                              0
    sm__inst_executed_pipe_tensor_op_hmma.sum.peak_burst                        inst/cycle                            272
    sm__inst_executed_pipe_tensor_op_hmma.sum.peak_sustained                    inst/cycle                             68
    sm__inst_executed_pipe_tensor_op_hmma.sum.per_cycle_active                  inst/cycle                              0
    sm__inst_executed_pipe_tensor_op_hmma.sum.per_cycle_elapsed                 inst/cycle                              0
    sm__inst_executed_pipe_tensor_op_hmma.sum.per_cycle_in_frame                inst/cycle                              0
    sm__inst_executed_pipe_tensor_op_hmma.sum.per_cycle_in_region               inst/cycle                              0
    sm__inst_executed_pipe_tensor_op_hmma.sum.per_second                       inst/second                              0

And it’s not clear how

.sum

is different from

.sum.peak_burst

nor if that’s the right metric family to be looking at.

Where can I find more information about these metrics? Is there a clear signal for “yes, AMP is working and using tensor cores properly”?

Follow up: if there’s an easy way to tell “yes, we’re using tensor cores here” from nsight system (esp. on a per-kernel basis), that would be much better for my use case.

I suspect that tensor cores are being used but something about the network structure isn’t conducive to AMP.

The correct metric to decide on tensor core usage would be in your case (Turing)

sm__inst_executed_pipe_tensor_op_hmma.sum

(the total # of warp instructions executed by tensor_op_hmma pipe)

In older versions of Nsight Compute, which were current when the blog post you referenced was written, we used a different set of metrics for Volta GPUs (the chip described in the blog post), and for that metric set the name would have been

sm__inst_executed_pipe_hmmafp32_sum

You can find information about which metric set is used for which chip in Nsight Compute here: https://docs.nvidia.com/nsight-compute/ReleaseNotes/index.html#gpu-support. Long term, we plan to use only a single set for all GPU architectures.

The difference between .sum and .sum.peak_burst is described here: https://docs.nvidia.com/cupti/Cupti/r_main.html#r_host_derived_metrics_api
Basically, the .peak_burst suffix describes the maximum possible value (peak) of this metric during short periods (burst).

As for

--list-metrics

, this option shows the metrics that are currently selected for profiling, e.g. due to the set of currently available section files. To see the list of possible metrics, you would use the option

--query-metrics

. See https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#command-line-options-profile for details.

Thanks for the helpful links, and yes, I meant --query-metrics.

To my larger question, is there an easy way to tell if AMP is working, in terms of running on the tensor cores? While I realize that TF nodes and kernel calls aren’t 1:1, I’m seeing logs like:

2019-06-05 11:11:16.234748: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1723] Converted 529/2864 nodes to float16 precision using 3 cast(s) to float16 (excluding Const and Variable casts)

Which implies to me that a large number of calls won’t be eligible. Is there a better way to find out what’s going on beyond looking at the list of kernel calls in nsight systems, and calling nsight compute on each one that mentions __half?

Are there minimum driver versions that I need?

$ nvidia-smi
Thu Jun  6 10:31:33 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0  On |                  N/A |
| 10%   55C    P8    43W / 250W |    171MiB / 10986MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1186      G   /usr/lib/xorg/Xorg                           169MiB |
+-----------------------------------------------------------------------------+