Not getting NVTX events from child processes on Linux

I wrote a simple python test program that spawns a single thread and a single child process, each using nvtx push/pop range calls. On Windows this works great, I see the NVTX events for everything when I capture it running, but when I capture this when running under Linux (such as WSL), I only get the NVTX events from the main thread, I see the child thread in the capture, but no events. I don’t even see the child process id mentioned in the diagnostic summary. Am I missing something or does this not work on Linux?


Are you using the CLI? Can you give me the command line you were using?

I met a similar problem while trying to profile a multiprocess app.
The profiler is launched as nsys profile -o file my_app. my_app doesn’t perform any GPU computations but launches a large amount of processes and some of them perform GPU computations. For this command, no GPU computations are shown by nsys-ui (CPU sampling info for child processes is still available though). When I use my_app to launch individual processes under profiler, all works fine, and I can see GPU information (CUDA streams and operations) in nsys-ui.
Unfortunately, the documentation doesn’t say clear if NSight Systems can trace child process GPU usage or not. No CLI option to set or change this behavior is available as well.
NSight System version: 2021.1.3.
The environment is Ubuntu Linux 18.04.5 in nvidia-docker.

Is there any chance that you are using fork without exec?

If yes, you will need to explicitly request that on Linux in the GUI:

Or use the --trace-fork-before-exec=true in the command line.

Thank you for the answer. The processes we want to trace are launched with exec, not just forked. Anyway, I tried --trace-fork-before-exec=true but it didn’t help.
I also tried to update the profiler to 2021.2 and use --gpu-metrics-device=all option. Still no luck, GPU traces for the processes were not available.

Just to confirm, do you mean:

  • CUDA trace is missing when my_app launches a large amount of processes
  • CUDA trace works well when my_app only launches one single child process

Could you share report files for both scenarios?

Hello liuyis,

  1. When my_app launches a large amount of processes and I trace my_app, CUDA traces are missing.
  2. When my_app launches a large amount of processes and some of them are launched under profiler (by modifying command line), CUDA trace is present for these processes.
    Not sure if I can provide a report. I have some more hypotheses to check, will try it.

Hi @a-sidorin, thanks for the clarification, this may be a bug in Nsight Systems.

Is it possible to reduce the number of processes that my_app launches, and see if CUDA trace works (in the first scenario)?

Is it possible to create a simple reproducer that could help us investigate further?

After debugging the problem, I think the problem is caused by our intermediate scripts, not the tool itself. Thank you all.

1 Like