Using capture-range=nvtx

Hi, I have an application that is roughly outlined as follows, and each step has some NVTX range annotation

nvtxRangePush("Setup");
// CUDA setup code
nvtxRangePop();

nvtxRangePush("main_loop");
for (int idx=0; idx < loop_count; ++idx) {
  nvtxRangePush("Iteration");
  nvtxMark("loop {idx}"); // generates "loop 0", "loop 1", ... etc marks
  // actual interesting work
  nvtxRangePop();
}
nvtxRangePop();

nvtxRangePush("TearDown");
// tear down code
nvtxRangePop();

I can run

nsys -t cuda,nvtx

on the application without problem.

But sometimes I am only interested in profiling the main loop. So I tried using the flags:

nsys -t cuda,nvtx --capture-range=nvtx --nvtx-capture='main_loop' --nvtx-domain-include=default

But basically, as soon as --capture-range=nvtx is specified, no profiling report is generated.

Question is, am I using the flags correctly?

And additional question is: can I also match the mark string and say only profile a particular iteration?

/usr/local/bin/nsys  --version
NVIDIA Nsight Systems version 2022.4.1.21-0db2c85

By default, the --capture-range=nvtx only checks for registered strings (because checking for registered strings is significantly lower overhead and most NVTX annotations use them). To check all strings add the NSYS_NVTX_PROFILER_REGISTER_ONLY=0 env. variable.

There is an example given in the documentation using the interactive CLI, but the same thing would with all options under profile, see Run application, start/stop collection using NVTX under User Guide :: Nsight Systems Documentation

@skottapalli, do you have good suggestions for limiting to a single iteration.

Holly’s suggestion to use NVTX registered strings is a good one. See NVTX C API Reference: String Registration for documentation on how to register a string.

If you don’t mind the performance degradation that comes with plain strings, then you could add --env-var=NSYS_NVTX_PROFILER_REGISTER_ONLY=0 switch to your nsys command line nsys -t cuda,nvtx --capture-range=nvtx --nvtx-capture='main_loop' --nvtx-domain-include=default

Another alternative is to put cudaProfilerStart and cudaProfilerStop API calls around the main loop you are interested in profiling and use --capture-range=cudaProfilerApi switch.

Thank you for the information. Registering strings seem to be the way to go.