Difference in thread-level instruction counts when also collecting warp-level instruction counts

Summary

Using NSight Compute to measure thread-level instructions can produce very different instruction counts when also measuring warp-level instruction counts versus if only thread-level instruction counts are collected. This only seems to happen in certain circumstances, such as when a kernel exists which calls a function from a separate object file, while another kernel is also called from the same object file. In this case, when measuring just the thread-level instruction count, it seems as if only the instructions in separate object file are counted.

Environment

Component Version
GPU NVIDIA GeForce RTX 3080 Laptop GPU
NCU 2025.4.1.0 (build 37053803)
CUDA 13.1
Driver 590.48.01

Steps to reproduce

I have attached a MWE here: nsight-mwe.zip (3.1 KB). You can either run the provided convenience script, run.sh, or follow the following steps:

  1. Build the example, e.g. using CMake.
  2. Measure just the thread-level metrics: ncu --metrics sass__thread_inst_executed_true_per_opcode_with_modifier_all --print-metric-instances details ./build/mwe
  3. Measure both the thread-level and warp-level metrics: ncu --metrics sass__thread_inst_executed_true_per_opcode_with_modifier_all,sass__inst_executed_per_opcode_with_modifier_all --print-metric-instances details ./build/mwe

If you need any more details to reproduce this issue, feel free to let me know.

Expected behavior: both measurements report the same number of instructions.
Observed behavior: measuring just the thread-level instructions reports a total of 65536 instructions, whereas measuring both warp- and thread-level instructions report 720901 instructions.

When enabling link-time optimizations or when calling identical functions from main.cu instead of test.cu (an external object file), this difference vanishes and both measurements report a total of 720901 instructions.

Has anyone run into a similar issue? A reasonable workaround for now seems to be to just always measure the warp-level instruction counts, even if they aren’t needed.