Metrics smsp__sass_thread_inst_executed_op* returns n/a

System: Ubuntu 16.04
Driver: 418.56 (installed through apt-get PPA)
CUDA toolkit: 10.1.105 with NSIGHT COMPUTE 2019.3
Benchmark: VectorAdd in CUDA samples

When I ran nv-nsight-cu-cli --query-metrics, I was able to see metrics in the form of smsp__sass_thread_inst_executed_op_*. However, I tried capturing those metrics nv-nsight-cu-cli --metrics <smsp...> ./vectorAdd, the profiler gives “(!)n/a”. When I tried running without --metrics or with predefined section files in the Nsight package, I was able to see other performance counters with numerical results printed.

Are those metrics smsp__sass_thread_inst_executed_op* actually available in Nsight Compute?

Thanks!

Which GPU are you using?

Thanks for your prompt reply! It’s GeForce RTX 2070.

This issue was mentioned in another user’s post already, so I post the same answer here:

Those metrics were enabled in our measurement library and correctly added to the documentation, but we missed actually enabling this feature of the measurement library in the tool. We will fix this soon in a future release.

In the meantime, you might be able to use the “Executed Instruction Mix” chart of the Instruction Statistics (InstructionStats) section as a workaround. You can collect this section either on the command line or in the UI, but the chart can only be viewed in the UI. When using the command line, the section should be collected by default, otherwise you can enable it using --section InstructionStats.

Thanks for your reply!

Would you be possibly able to tell what is the rough timeline these metrics will get integrated?

Regarding the executed instruction mix chart, is sass__inst_executed_per_opcode some magic metric that is only for drawing the charts in Nsight Compute? Is it possible to dump the raw data in the command line as well?

I’ll add that these metrics are very important and I would also like to see them implemented in nv-nsight-cu-cli as soon as possible (especially since the equivalents in nvprof don’t work for the latest GPUs).

I’ve found that by collecting the instructions executed per pipe, the sum is
very close across a range of kernels to the total instructions executed. This is how I’ve been getting around not having the actual breakdown in the CLI from sass__inst_executed_per_opcode

Metrics {
    Label: "Executed Instructions - Pipeline ADU"
    Name: "sm__inst_executed_pipe_adu.sum"
  }
  Metrics {
    Label: "Executed Instructions - Pipeline ALU"
    Name: "sm__inst_executed_pipe_alu.sum"
  }
  Metrics {
    Label: "Executed Instructions - Pipeline CBU"
    Name: "sm__inst_executed_pipe_cbu.sum"
  }
  Metrics {
    Label: "Executed Instructions - Pipeline FMA"
    Name: "sm__inst_executed_pipe_fma.sum"
  }
  Metrics {
    Label: "Executed Instructions - Pipeline FP16"
    Name: "sm__inst_executed_pipe_fp16.sum"
  }
  Metrics {
    Label: "Executed Instructions - Pipeline FP64"
    Name: "sm__inst_executed_pipe_fp64.sum"
  }
  Metrics {
    Label: "Executed Instructions - Pipeline IPA"
    Name: "sm__inst_executed_pipe_ipa.sum"
  }
  Metrics {
    Label: "Executed Instructions - Pipeline LSU"
    Name: "sm__inst_executed_pipe_lsu.sum"
  }
  Metrics {
    Label: "Executed Instructions - Pipeline Tensor"
    Name: "sm__inst_executed_pipe_tensor.sum"
  }
  Metrics {
    Label: "Executed Instructions - Pipeline TensorOpHMMA"
    Name: "sm__inst_executed_pipe_tensor_op_hmma.sum"
  }
  Metrics {
    Label: "Executed Instructions - Pipeline TensorOpIMMA"
    Name: "sm__inst_executed_pipe_tensor_op_imma.sum"
  }
  Metrics {
    Label: "Executed Instructions - Pipeline TEX"
    Name: "sm__inst_executed_pipe_tex.sum"
  }
  Metrics {
    Label: "Executed Instructions - Pipeline Uniform"
    Name: "sm__inst_executed_pipe_uniform.sum"
  }
  Metrics {
    Label: "Executed Instructions - Pipeline XU"
    Name: "sm__inst_executed_pipe_xu.sum"
  }

Edit: It would be useful to know which instructions correspond to which pipelines. Some are clear, but some like ‘Uniform’ and ‘XU’ are not as clear.

An explanation of these metrics would also be helpful to me.

This source has a list of instructions per path/pipeline: https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#turing

It has a list for the uniform datapath.
Many of the control instructions should map to the CBU (branch unit)

I can’t vouch for correctness, but https://arxiv.org/pdf/1903.07486.pdf (3.5.2) describes the uniform datapath.

I believe XU is a new/different name for what was the “special-function unit” (SFU) or “multi-function unit” (MUFU - an instruction).