Generating CUPTI_* tables with nsys

That will stop you from having GPU metrics, but you should still have the CUDA information. And you don’t have any CUDA tables and the diagnostics says it didn’t hit CUDA.

Okay, what version of nsys are you looking at? What was the exact command line you used?

Verson:
NVIDIA Nsight Systems version 2022.5.1.82-32078057v0

Command:
nsys profile -t cuda,nvtx,osrt,cublas -f true -o /data/results2 --gpu-metrics-device=all --cuda-memory-usage=true --export=sqlite ./gpu_burn 10

How long did profiling run (how long did the “gpu_burn 10” take)?

@skottapalli can you take a look at this?

gpu_burn 10 runs for 10 seconds.

Sometimes we see things like this when the application is so brief that we miss the kernels, but that isn’t the case here.

Hi tniro,

Could you try just the following command?

nsys profile -t cuda -s none --cpuctxsw=none -f true -o /data/results2 ./gpu_burn 10

This will collect just the CUDA traces (on CPU and GPU side). Could you share the report file (privately, if needed)?

What is the output of nvidia-smi command on the host system?

Unfortunately, I’m on vacation till the new year. I had some issues with the CentOS container I was using (see above Dockerfile), when the system I was using was updated from 520 to 525. Running the application (gpu_burn) in a Ubuntu container worked. However, in CentOS it stopped working when the driver was updated. I’ll be able to do more investigation once back from vacation. Here is the current nvidia-smi output

nvidia-smi
Mon Dec 19 09:30:07 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE… Off | 00000000:25:00.0 Off | 0 |
| N/A 35C P0 37W / 250W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-PCIE… Off | 00000000:E2:00.0 Off | 0 |
| N/A 34C P0 36W / 250W | 0MiB / 32768MiB | 4% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

T.

Back from vacation. I’ve run with the following command:

 nsys profile  -t cuda,nvtx,osrt,cublas -f true -o /data/results2 --gpu-metrics-device=all --cuda-memory-usage=true --export=sqlite ./gpu_burn 30

I’ve attached the report file:
results2.nsys-rep (10.3 MB)

From the report, it looks like you used the command as I requested.
nsys profile -t cuda -s none --cpuctxsw=none -f true -o /data/results2 ./gpu_burn 30

CUDA kernels and API calls are present in the report you shared. What is the problem you are facing now? If you add the other features -t cuda,nvtx,osrt,cublas back to the command line, are the CUDA kernels missing from the report? If so, we will need to isolate which feature is actually causing the problem. Please try different combinations (for example, profile with -t cuda,nvtx first and if that works, profile with -t cuda,nvtx,osrt and so on)


If I just specify -t cuda then I get the expected CUPTI tables. However, adding any of the other features with cuda, the tables don’t show up in the sqllite file. I tried with “-t cuda,nvtx” “-t cuda,osrt”, “-t cuda,cublas” and not specifying -t option (default).

nvlog.config.template (648 Bytes)
I see. Thanks for the update. Could you try the options without cuda and check if you get any events in the report? It is possible that there is a bug with CUDA tracing feature when combined with the other options. We will need to repro it on our end to investigate more.

Could you help collect logs when you see the problem?

  1. Save the nvlog.config.template file that is attached to the target system
  2. Add the CLI switch -e NVLOG_CONFIG_FILE=/full/path/to/nvlog.config.template to your command line
  3. Run the collection
    Please share the report file and the nsys-ui.log file that gets created.

Running the following:
nsys profile -t nvtx,osrt,cublas -s none --cpuctxsw=none -f true -o /rockshare/user/tniro/db/nvidia ./gpu_burn 30

Results attached:
nvidia.nsys-rep (350.6 KB)

I also tried using 2022.4.1.21 in my CentOS 7 container. Same issues.

NVIDIA has a number of containers that have nsight installed but I can’t use them since they are Ubuntu based and nsys fails when running the container on a CentOS 7 host, kernel being too low. For example using nvcr.io/nvidia/mxnet:22.12-py3 on CentOS 7, I run

root@f3a510160b6b:/workspace# nsys status -e
Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: enabled
Linux Kernel Paranoid Level = 2
Linux Distribution = Ubuntu
Linux Kernel Version = 3.10.0-1160.81.1.el7.x86_64: Fail
Linux perf_event_open syscall available: Fail
Sampling trigger event available: Fail
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): Fail
CPU Profiling Environment (system-wide): Fail

See the product documentation at https://docs.nvidia.com/nsight-systems for more information,
including information on how to set the Linux Kernel Paranoid Level.

I wonder if we should convert to Ubuntu? ;-)

The failure indicated in the nsys status output on CentOS 7 host means that the --sampling true feature will not work (switching to a newer kernel will address this). The rest of the features should work even on the older OS kernel.

The main problem in the docker container running on your machine is that CUDA traces are collected only if tracing is limited to CUDA. It sounds like a bug that my team needs to track down. How can we reproduce it on our end?

Also, could you help me by collecting the logs as I mentioned in my previous reply?

Ran the following command:

nsys profile -t cuda,nvtx,osrt,cublas  -s none --cpuctxsw=none -f true -o /rockshare/user/tniro/nvidia/test   -e NVLOG_CONFIG_FILE=/rockshare/user/tniro/nvidia/nvlog.config.template --export=sqlite ./gpu_burn 30

Log and report attached:
nsys-ui.log (367.6 KB)
test.nsys-rep (349.6 KB)

Reproducing the issue:

  1. Build container
    Dockerfile (1.4 KB)

  2. Run container:
    docker run --rm --gpus=all --cap-add=SYS_ADMIN -v $(pwd):/data -it mycontainer:latest bash

  3. Run from gpu-burn directory
    cd /opt/gpu-burn

  4. Run nsight
    nsys profile -t cuda,nvtx,osrt,cublas -s none --cpuctxsw=none -f true -o /data/test -e NVLOG_CONFIG_FILE=/data/nvlog.config.template --export=sqlite ./gpu_burn 30

Thank you for sharing the repro steps. I am able to repro the bug on my end. We will investigate and report back soon.

Profiling other CUDA apps inside the container works as expected. I think it is a problem specific to the gpu_burn app. I am able to repro the bug on my Ubuntu machine with just

> git clone https://github.com/wilicc/gpu-burn
> cd gpu-burn/
> CFLAGS="-g" LDFLAGS="-g" make
> nsys profile -t cuda,nvtx,osrt,cublas -s none --cpuctxsw=none ./gpu_burn 30

That’s interesting. Good to know. So just out of curiosity, how does the application affect what data the tool collects?

Hi tniro,

The reason why you see traces when you only trace CUDA but not when you add other trace features is quite complex. For short, you need to add the --trace-fork-before-exec true option when you run the Nsight Systems’ CLI. The gory details are below.

The problem is the way the gpu_burn program works. The root process doesn’t submit any GPU work. The program creates additional processes for that. To create those processes, it calls fork to create a copy of the parent process image. But they never call an exec function to execute a different program.

A process that never called an exec function is subject to heavy restrictions when it was created from a multi-threaded process. Quoting the fork POSIX specification below:

A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called.

The list of async-signal-safe function is very slim. For example, malloc is not async-signal-safe from a POSIX perspective.

The gpu_burn root process is single-threaded so its child processes should not be subject to those restrictions. The problem is that Nsight Systems creates at least one additional thread in each process it traces for performance reason. As a result, the gpu_burn child processes are subject to the async-signal-safety restriction when the program is being profiled. It implies multiple things:

  1. A program that relies on the fork without exec idiom might not work when being profiled. It might create processes from a single-threaded parent and the developers in such case don’t have to be limited to async-signal-safe operations in the child processes. When the program is being profiled, the predicate changes because of the extra thread(s) created. Thankfully, the Glibc tries to handle fork without exec as gracefully as possible, to more extent than what POSIX specifies (e.g., in practice, malloc can be called safely)
  2. Nsight Systems cannot safely trace processes that never called can exec function because it requires calling non-async-signal-safe functions. For that reason, we have some internal logic to disable all tracing in the child process right after a call to fork. Tracing is re-enabled when an exec function is called. The --trace-fork-before-exec option can modify this behavior to allow tracing processes that never called and exec function.

When you only enable CUDA tracing, Nsight Systems’ injection libraries are only loaded when the CUDA driver initializes. It doesn’t happen in the gpu_burn root process because it’s not submitting any GPU work. For that reason, the root process stays single-threaded and we can trace the child processes safely even if they never called an exec function.

On the other hand, when you enable OS runtime tracing (--trace osrt), Nsight Systems preloads its injection libraries. As a result, the gpu_burn root process is multi-threaded and the async-signal-safe restrictions apply to its child processes. For that reason, you won’t see any traces except from the root process which doesn’t generate any GPU work.

This is why you see CUDA traces when you profile gpu_burn with --trace cuda but not with --trace cuda,osrt. As I said earlier, to remedy this problem, you’d have to additionally specify --trace-fork-before-exec true.

We have an internal ticket opened to have the profiler report when a process wasn’t traced because it never called an exec function and came from a multi-threaded parent. But it’s actually quite complex to do this and was never considered high priority compared to other tasks.

Excellent. Thanks.
T