The reason why you see traces when you only trace CUDA but not when you add other trace features is quite complex. For short, you need to add the
--trace-fork-before-exec true option when you run the Nsight Systems’ CLI. The gory details are below.
The problem is the way the
gpu_burn program works. The root process doesn’t submit any GPU work. The program creates additional processes for that. To create those processes, it calls
fork to create a copy of the parent process image. But they never call an
exec function to execute a different program.
A process that never called an
exec function is subject to heavy restrictions when it was created from a multi-threaded process. Quoting the fork POSIX specification below:
A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called.
The list of async-signal-safe function is very slim. For example,
malloc is not async-signal-safe from a POSIX perspective.
gpu_burn root process is single-threaded so its child processes should not be subject to those restrictions. The problem is that Nsight Systems creates at least one additional thread in each process it traces for performance reason. As a result, the
gpu_burn child processes are subject to the async-signal-safety restriction when the program is being profiled. It implies multiple things:
- A program that relies on the
exec idiom might not work when being profiled. It might create processes from a single-threaded parent and the developers in such case don’t have to be limited to async-signal-safe operations in the child processes. When the program is being profiled, the predicate changes because of the extra thread(s) created. Thankfully, the Glibc tries to handle
exec as gracefully as possible, to more extent than what POSIX specifies (e.g., in practice,
malloc can be called safely)
- Nsight Systems cannot safely trace processes that never called can
exec function because it requires calling non-async-signal-safe functions. For that reason, we have some internal logic to disable all tracing in the child process right after a call to
fork. Tracing is re-enabled when an
exec function is called. The
--trace-fork-before-exec option can modify this behavior to allow tracing processes that never called and
When you only enable CUDA tracing, Nsight Systems’ injection libraries are only loaded when the CUDA driver initializes. It doesn’t happen in the
gpu_burn root process because it’s not submitting any GPU work. For that reason, the root process stays single-threaded and we can trace the child processes safely even if they never called an
On the other hand, when you enable OS runtime tracing (
--trace osrt), Nsight Systems preloads its injection libraries. As a result, the
gpu_burn root process is multi-threaded and the async-signal-safe restrictions apply to its child processes. For that reason, you won’t see any traces except from the root process which doesn’t generate any GPU work.
This is why you see CUDA traces when you profile
--trace cuda but not with
--trace cuda,osrt. As I said earlier, to remedy this problem, you’d have to additionally specify
We have an internal ticket opened to have the profiler report when a process wasn’t traced because it never called an
exec function and came from a multi-threaded parent. But it’s actually quite complex to do this and was never considered high priority compared to other tasks.