Hi tniro,
The reason why you see traces when you only trace CUDA but not when you add other trace features is quite complex. For short, you need to add the --trace-fork-before-exec true
option when you run the Nsight Systems’ CLI. The gory details are below.
The problem is the way the gpu_burn
program works. The root process doesn’t submit any GPU work. The program creates additional processes for that. To create those processes, it calls fork
to create a copy of the parent process image. But they never call an exec
function to execute a different program.
A process that never called an exec
function is subject to heavy restrictions when it was created from a multi-threaded process. Quoting the fork POSIX specification below:
A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called.
The list of async-signal-safe function is very slim. For example, malloc
is not async-signal-safe from a POSIX perspective.
The gpu_burn
root process is single-threaded so its child processes should not be subject to those restrictions. The problem is that Nsight Systems creates at least one additional thread in each process it traces for performance reason. As a result, the gpu_burn
child processes are subject to the async-signal-safety restriction when the program is being profiled. It implies multiple things:
- A program that relies on the
fork
without exec
idiom might not work when being profiled. It might create processes from a single-threaded parent and the developers in such case don’t have to be limited to async-signal-safe operations in the child processes. When the program is being profiled, the predicate changes because of the extra thread(s) created. Thankfully, the Glibc tries to handle fork
without exec
as gracefully as possible, to more extent than what POSIX specifies (e.g., in practice, malloc
can be called safely)
- Nsight Systems cannot safely trace processes that never called can
exec
function because it requires calling non-async-signal-safe functions. For that reason, we have some internal logic to disable all tracing in the child process right after a call to fork
. Tracing is re-enabled when an exec
function is called. The --trace-fork-before-exec
option can modify this behavior to allow tracing processes that never called and exec
function.
When you only enable CUDA tracing, Nsight Systems’ injection libraries are only loaded when the CUDA driver initializes. It doesn’t happen in the gpu_burn
root process because it’s not submitting any GPU work. For that reason, the root process stays single-threaded and we can trace the child processes safely even if they never called an exec
function.
On the other hand, when you enable OS runtime tracing (--trace osrt
), Nsight Systems preloads its injection libraries. As a result, the gpu_burn
root process is multi-threaded and the async-signal-safe restrictions apply to its child processes. For that reason, you won’t see any traces except from the root process which doesn’t generate any GPU work.
This is why you see CUDA traces when you profile gpu_burn
with --trace cuda
but not with --trace cuda,osrt
. As I said earlier, to remedy this problem, you’d have to additionally specify --trace-fork-before-exec true
.
We have an internal ticket opened to have the profiler report when a process wasn’t traced because it never called an exec
function and came from a multi-threaded parent. But it’s actually quite complex to do this and was never considered high priority compared to other tasks.