Generating CUPTI_* tables with nsys

Ran the following command:

nsys profile -t cuda,nvtx,osrt,cublas  -s none --cpuctxsw=none -f true -o /rockshare/user/tniro/nvidia/test   -e NVLOG_CONFIG_FILE=/rockshare/user/tniro/nvidia/nvlog.config.template --export=sqlite ./gpu_burn 30

Log and report attached:
nsys-ui.log (367.6 KB)
test.nsys-rep (349.6 KB)

Reproducing the issue:

  1. Build container
    Dockerfile (1.4 KB)

  2. Run container:
    docker run --rm --gpus=all --cap-add=SYS_ADMIN -v $(pwd):/data -it mycontainer:latest bash

  3. Run from gpu-burn directory
    cd /opt/gpu-burn

  4. Run nsight
    nsys profile -t cuda,nvtx,osrt,cublas -s none --cpuctxsw=none -f true -o /data/test -e NVLOG_CONFIG_FILE=/data/nvlog.config.template --export=sqlite ./gpu_burn 30

Thank you for sharing the repro steps. I am able to repro the bug on my end. We will investigate and report back soon.

Profiling other CUDA apps inside the container works as expected. I think it is a problem specific to the gpu_burn app. I am able to repro the bug on my Ubuntu machine with just

> git clone https://github.com/wilicc/gpu-burn
> cd gpu-burn/
> CFLAGS="-g" LDFLAGS="-g" make
> nsys profile -t cuda,nvtx,osrt,cublas -s none --cpuctxsw=none ./gpu_burn 30

That’s interesting. Good to know. So just out of curiosity, how does the application affect what data the tool collects?

Hi tniro,

The reason why you see traces when you only trace CUDA but not when you add other trace features is quite complex. For short, you need to add the --trace-fork-before-exec true option when you run the Nsight Systems’ CLI. The gory details are below.

The problem is the way the gpu_burn program works. The root process doesn’t submit any GPU work. The program creates additional processes for that. To create those processes, it calls fork to create a copy of the parent process image. But they never call an exec function to execute a different program.

A process that never called an exec function is subject to heavy restrictions when it was created from a multi-threaded process. Quoting the fork POSIX specification below:

A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called.

The list of async-signal-safe function is very slim. For example, malloc is not async-signal-safe from a POSIX perspective.

The gpu_burn root process is single-threaded so its child processes should not be subject to those restrictions. The problem is that Nsight Systems creates at least one additional thread in each process it traces for performance reason. As a result, the gpu_burn child processes are subject to the async-signal-safety restriction when the program is being profiled. It implies multiple things:

  1. A program that relies on the fork without exec idiom might not work when being profiled. It might create processes from a single-threaded parent and the developers in such case don’t have to be limited to async-signal-safe operations in the child processes. When the program is being profiled, the predicate changes because of the extra thread(s) created. Thankfully, the Glibc tries to handle fork without exec as gracefully as possible, to more extent than what POSIX specifies (e.g., in practice, malloc can be called safely)
  2. Nsight Systems cannot safely trace processes that never called can exec function because it requires calling non-async-signal-safe functions. For that reason, we have some internal logic to disable all tracing in the child process right after a call to fork. Tracing is re-enabled when an exec function is called. The --trace-fork-before-exec option can modify this behavior to allow tracing processes that never called and exec function.

When you only enable CUDA tracing, Nsight Systems’ injection libraries are only loaded when the CUDA driver initializes. It doesn’t happen in the gpu_burn root process because it’s not submitting any GPU work. For that reason, the root process stays single-threaded and we can trace the child processes safely even if they never called an exec function.

On the other hand, when you enable OS runtime tracing (--trace osrt), Nsight Systems preloads its injection libraries. As a result, the gpu_burn root process is multi-threaded and the async-signal-safe restrictions apply to its child processes. For that reason, you won’t see any traces except from the root process which doesn’t generate any GPU work.

This is why you see CUDA traces when you profile gpu_burn with --trace cuda but not with --trace cuda,osrt. As I said earlier, to remedy this problem, you’d have to additionally specify --trace-fork-before-exec true.

We have an internal ticket opened to have the profiler report when a process wasn’t traced because it never called an exec function and came from a multi-threaded parent. But it’s actually quite complex to do this and was never considered high priority compared to other tasks.

Excellent. Thanks.
T