Hi, I have an application that is roughly outlined as follows, and each step has some NVTX range annotation
nvtxRangePush("Setup");
// CUDA setup code
nvtxRangePop();
nvtxRangePush("main_loop");
for (int idx=0; idx < loop_count; ++idx) {
nvtxRangePush("Iteration");
nvtxMark("loop {idx}"); // generates "loop 0", "loop 1", ... etc marks
// actual interesting work
nvtxRangePop();
}
nvtxRangePop();
nvtxRangePush("TearDown");
// tear down code
nvtxRangePop();
I can run
nsys -t cuda,nvtx
on the application without problem.
But sometimes I am only interested in profiling the main loop. So I tried using the flags:
nsys -t cuda,nvtx --capture-range=nvtx --nvtx-capture='main_loop' --nvtx-domain-include=default
But basically, as soon as --capture-range=nvtx
is specified, no profiling report is generated.
Question is, am I using the flags correctly?
And additional question is: can I also match the mark string and say only profile a particular iteration?
/usr/local/bin/nsys --version
NVIDIA Nsight Systems version 2022.4.1.21-0db2c85
By default, the --capture-range=nvtx only checks for registered strings (because checking for registered strings is significantly lower overhead and most NVTX annotations use them). To check all strings add the NSYS_NVTX_PROFILER_REGISTER_ONLY=0 env. variable.
There is an example given in the documentation using the interactive CLI, but the same thing would with all options under profile, see Run application, start/stop collection using NVTX under User Guide :: Nsight Systems Documentation
@skottapalli, do you have good suggestions for limiting to a single iteration.
Holly’s suggestion to use NVTX registered strings is a good one. See NVTX C API Reference: String Registration for documentation on how to register a string.
If you don’t mind the performance degradation that comes with plain strings, then you could add --env-var=NSYS_NVTX_PROFILER_REGISTER_ONLY=0
switch to your nsys command line nsys -t cuda,nvtx --capture-range=nvtx --nvtx-capture='main_loop' --nvtx-domain-include=default
Another alternative is to put cudaProfilerStart and cudaProfilerStop API calls around the main loop you are interested in profiling and use --capture-range=cudaProfilerApi switch.
Thank you for the information. Registering strings seem to be the way to go.