I wrote a simple python test program that spawns a single thread and a single child process, each using nvtx push/pop range calls. On Windows this works great, I see the NVTX events for everything when I capture it running, but when I capture this when running under Linux (such as WSL), I only get the NVTX events from the main thread, I see the child thread in the capture, but no events. I don’t even see the child process id mentioned in the diagnostic summary. Am I missing something or does this not work on Linux?
I met a similar problem while trying to profile a multiprocess app.
The profiler is launched as nsys profile -o file my_app. my_app doesn’t perform any GPU computations but launches a large amount of processes and some of them perform GPU computations. For this command, no GPU computations are shown by nsys-ui (CPU sampling info for child processes is still available though). When I use my_app to launch individual processes under profiler, all works fine, and I can see GPU information (CUDA streams and operations) in nsys-ui.
Unfortunately, the documentation doesn’t say clear if NSight Systems can trace child process GPU usage or not. No CLI option to set or change this behavior is available as well.
NSight System version: 2021.1.3.
The environment is Ubuntu Linux 18.04.5 in nvidia-docker.
Thank you for the answer. The processes we want to trace are launched with exec, not just forked. Anyway, I tried --trace-fork-before-exec=true but it didn’t help.
I also tried to update the profiler to 2021.2 and use --gpu-metrics-device=all option. Still no luck, GPU traces for the processes were not available.
When my_app launches a large amount of processes and I trace my_app, CUDA traces are missing.
When my_app launches a large amount of processes and some of them are launched under profiler (by modifying command line), CUDA trace is present for these processes.
Not sure if I can provide a report. I have some more hypotheses to check, will try it.