n/a for metrics

I have run a nv-nsight-cu-cli job to collect one metric. When I open that in NsightCompute, it show sn/a for all kernels. However, the file size it quite large (90MB). I see similar things for other metrics.
It is really strange. I have tested 2019.4 and still have problem.

I have uploaded one file at 2080ti..dram_read_transactions.nsight-cuprof-report
Also, a picture is available at Pasteboard - Uploaded Image
I see this message at the bottom:

1,Host,NVIDIA Nsight Compute,“SASS analysis failed. Some information might not be available (warning : Section ‘.text._Z15size_one_kernelIdEvPKvPvxxxx9fftKind_t10callback_t’, pc= 0x100, Indirect branch address must go via symbolic table base in constant memory (missing relocator information?)).”

Should I do something with the profiling command or compilation?
Can someone explain what is the issue?

Unfortunately, I can’t tell from your report why the metric collection failed for you.

The error message that you reported is a result of the cubin disassembling failing for at least one of your kernels. It failed because the disassembler can’t statically resolve some dynamic branch instructions in the code. This has no impact on the dram__sectors_read metric being not available, it solely affects some metrics such as Live Registers in the Source page, which won’t be available if this message occurs.

Note that the report still contains one cubin (elf module) for each profiled kernel type. In addition, it contains other metrics such as the device attributes (which aren’t stored very efficiently yet). As a result, the report will still have a considerable size despite the metric you were interested in not being available.

To narrow down the problem, could you check

  • If you can profile any other application (such as a simple CUDA sample) on the same machine with Nsight Compute?
  • Does Gromacs execute properly when running without the profiler?
  • When running with Nsight Compute, does Gromacs show any unusual errors, indicating that the kernels failed to execute? Note that you can limit profiling to e.g. the first ten kernels by passing -c 10 to nv-nsight-cu-cli to speed up this testing.
  • Can you collect those metrics when running as root/sudo on the machine, or when the machine’s admin enables non-root profiling? Normally, this would be indicated by the error message described here: https://developer.nvidia.com/nvidia-development-tools-solutions-ERR_NVGPUCTRPERM-permission-issue-performance-counters

OK. I found new thing.

For one the metrics that I have seen n/a, atomic transactions, according to the metric document, it should be

nv-nsight-cu-cli --quiet --metrics l1tex__t_set_accesses_pipe_lsu_mem_global_op_atom.sum+l1tex__t_set_accesses_pipe_lsu_mem_global_op_red.sum+l1tex__t_set_accesses_pipe_tex_mem_surface_op_atom.sum+l1tex__t_set_accesses_pipe_tex_mem_surface_op_red.sum -f -o 2080ti.atomic gmx mdrun -nb gpu -v -deffnm nvt_5k

That is shown as n/a which can bee seen at https://pasteboard.co/IGFbwO4.png
So, I decided to collect individual metrics

l1tex__t_set_accesses_pipe_lsu_mem_global_op_atom.sum = 0
l1tex__t_set_accesses_pipe_lsu_mem_global_op_red.sum = 613481
l1tex__t_set_accesses_pipe_tex_mem_surface_op_atom.sum = 0
l1tex__t_set_accesses_pipe_tex_mem_surface_op_red.sum = 0

I will continue with other metrics

Please note that you cannot collect multiple metrics connected with “+” in Nsight Compute. That computation is given as an example on how to combine individual Nsight Compute metrics to map to nvprof metrics, since sometimes they don’t match 1:1. To collect multiple metrics at one on the command line, separate them by comma “,”, as stated in the documentation.

Furthermore, note that copying metric names from the comparison table can have issues if not the latest online version is used. Former versions used non-printable characters to improve the formatting online and in the pdf, which introduced the problem that those characters pasted to the CLI would result in invalid metric names. If in doubt, please type the names by hand from the docs. Copying names from the CLI metric query (–query-metrics --chip ) has no such issue.

Thanks. The non-printable characters was really bothering…

@felix_dt
I have some updates. I have noticed that for some metrics, smsp works but sm is shown as n/a. I tried with 2019.4 version.

I have uploaded 2 set of analyses.

1- For shared_load_transactions, I collected smsp__inst_executed_op_shared_ld.sum and sm__inst_executed_op_shared_ld.sum
You can download the zip report from https://gofile.io/?c=ccoHjF

2- For SP FP instructions, I collected
smsp__sass_thread_inst_executed_op_fadd_pred_on.sum,
smsp__sass_thread_inst_executed_op_ffma_pred_on.sum,
smsp__sass_thread_inst_executed_op_fmul_pred_on.sum

and
sm__sass_thread_inst_executed_op_fadd_pred_on.sum,
sm__sass_thread_inst_executed_op_ffma_pred_on.sum,
sm__sass_thread_inst_executed_op_fmul_pred_on.sum

You can download the zip report from https://gofile.io/?c=7p58dH

I hope there are enough debug information in the report files for developers. Actually, I ran Gromacs. I have to say that such behavior may not be seen in other programs. So, it is hard to find a program from SDK for such purpose.

One more questions:
The FP operations are calculated as fadd+fmul+2*ffma. Isn’t that true?
However, in the metric table, it is written as
smsp​_​_sass​_thread​_inst​_executed​_op​_fadd​_pred​_on.sum +
smsp​_​_sass​_thread​_inst​_executed​_op​_fmul​_pred​_on.sum +
smsp​_​_sass​_thread​_inst​_executed​_op​_ffma​_pred​_on.sum
.
Can you explain that? Is that a typo?
Or the value that smsp​_​_sass​_thread​_inst​_executed​_op​_fmul​_pred​_on.sum gives us is actually something multiplied by 2?

Also, based on the definition of smsp, the correct number of operations (similar to nvprof) should be
4*(smsp_fadd + smsp_fmul + smsp_2*ffma).
Am I right?

OK I ran a test on TitanV where I can use both nvprof and nsight.
I used these two commands:

nvprof --kernels "kernel_name" --metrics \
inst_fp_32,flop_count_sp,flop_count_sp_add,flop_count_sp_fma,flop_count_sp_mul,flop_count_sp_special \
-f -o titanv.fp.nvvp --log-file nvvp.log GMX_COMMAND

and

nv-nsight-cu-cli --quiet --kernel-regex "kernel_name" --metrics \
smsp__sass_thread_inst_executed_op_fadd_pred_on.sum,smsp__sass_thread_inst_executed_op_ffma_pred_on.sum,smsp__sass_thread_inst_executed_op_fmul_pred_on.sum \
-f -o titanv.fp.nsight GMX_COMMAND

Please note that smsp is used for nsight.
Looking at results, I see:

nvprof:
add = 3815856
mul = 16651008
fma = 1387584
special = 0
operations = 23242032 which is add+mul+2*fma

nsight:
smsp__sass_thread_inst_executed_op_fadd_pred_on.sum = 3815856
smsp__sass_thread_inst_executed_op_fmul_pred_on.sum = 16651008
smsp__sass_thread_inst_executed_op_ffma_pred_on.sum = 1387584

As you can see smsp values in nsight is equal to those that is reported by nvprof.
So, I think the metric comparison table for FP operations in nsight should be updated.
The picture of the result can be seen at Pasteboard - Uploaded Image

@felix_dt
Hi again,

I was able to reproduce n/a for matrilMul example with flop efficiency metric. Note that for the same input, nvprof shows fp efficiency as 26%.

Commands are:

nv-nsight-cu-cli --quiet \
--metrics smsp__thread_inst_executed_op_fadd_fmul_ffma.avg.pct_of_peak_sustained_elapsed \
-f -o titanv.nsight ./matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048

and

nvprof  --metrics flop_sp_efficiency -f -o titanv.nvvp \
--log-file nvvp.log ./matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048

As you can see in the picture https://pasteboard.co/IGNy3zj.png
nsight is n/a but nvprof is showing a number.

Could you get past this issue with nv-nsight-cu-cli?

Note this issue mentioned by Felix:
Furthermore, note that copying metric names from the comparison table can have issues if not the latest online version is used. Former versions used non-printable characters to improve the formatting online and in the pdf, which introduced the problem that those characters pasted to the CLI would result in invalid metric names. If in doubt, please type the names by hand from the docs. Copying names from the CLI metric query (–query-metrics --chip ) has no such issue.