Summary
Using NSight Compute to measure thread-level instructions can produce very different instruction counts when also measuring warp-level instruction counts versus if only thread-level instruction counts are collected. This only seems to happen in certain circumstances, such as when a kernel exists which calls a function from a separate object file, while another kernel is also called from the same object file. In this case, when measuring just the thread-level instruction count, it seems as if only the instructions in separate object file are counted.
Environment
| Component | Version |
|---|---|
| GPU | NVIDIA GeForce RTX 3080 Laptop GPU |
| NCU | 2025.4.1.0 (build 37053803) |
| CUDA | 13.1 |
| Driver | 590.48.01 |
Steps to reproduce
I have attached a MWE here: nsight-mwe.zip (3.1 KB). You can either run the provided convenience script, run.sh, or follow the following steps:
- Build the example, e.g. using CMake.
- Measure just the thread-level metrics:
ncu --metrics sass__thread_inst_executed_true_per_opcode_with_modifier_all --print-metric-instances details ./build/mwe - Measure both the thread-level and warp-level metrics:
ncu --metrics sass__thread_inst_executed_true_per_opcode_with_modifier_all,sass__inst_executed_per_opcode_with_modifier_all --print-metric-instances details ./build/mwe
If you need any more details to reproduce this issue, feel free to let me know.
Expected behavior: both measurements report the same number of instructions.
Observed behavior: measuring just the thread-level instructions reports a total of 65536 instructions, whereas measuring both warp- and thread-level instructions report 720901 instructions.
When enabling link-time optimizations or when calling identical functions from main.cu instead of test.cu (an external object file), this difference vanishes and both measurements report a total of 720901 instructions.
Has anyone run into a similar issue? A reasonable workaround for now seems to be to just always measure the warp-level instruction counts, even if they aren’t needed.