Nsys can't capture anything (cuda programs only)

Hello everyone, I hope to ask for help with the problem I encountered:
I wrote a piece of cuda code, compiled it with nvcc and hope to use nsys to profile it. However, nsys did not capture anything, and the analysis column showed:

No NVTX events collected. Does the process use NVTX?
No CUDA events collected. Does the process use CUDA?
No OS runtime libraries events collected. Does the process use OS runtime libraries?

I know clearly that my code has launched cuda kernels successfully, because the cudaevent timer has worked, and I also successfully profiled the kernel through ncu.
Strangely, I can only see a blank in the nsys timeline, not even a CPU trace.

But when I changed my execution program to python xxxx.py (which does some GPU operations through pytorch), profiling succeeded.

I’m currently using ubuntu24, cuda12.8, nsight-systems-2024.6.2, Driver Version: 570.133.20.
I can confirm that nsys was still working fine (for the exact same cuda program) just a day ago. But since I’m sharing a server with others, I have no way of knowing if they’ve made some changes to the system.

nsys status --all’s output:

Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 4
Linux Distribution = Ubuntu
Linux Kernel Version = 6.11.0-26-generic: OK
Linux perf_event_open syscall available: Fail
Sampling trigger event available: Fail
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): Fail
CPU Profiling Environment (system-wide): Fail

Network Profiling Environment Check
OFED version: Not Available
Network features' library dependencies: Fail

I would like to ask what I should do to check this problem? That is, nsys can profile a python program, but cannot profile a cuda executable file?

supplement: in the nsys-rep file’s diagnostics summary part, I can’t see anything like this:

Injection 177939 00:00.136 Common injection library initialized successfully.
Injection 177939 00:00.142 OS runtime libraries injection initialized successfully.
Analysis 00:02.465 Scheduling information is absent. The thread activity is deduced based on OS runtime libraries traces. This is inaccurate and does not take into account asynchronous interrupts and exception faults.
Analysis 177939 00:02.465 Number of NVTX events collected: 21.
Analysis 177939 00:02.465 Number of CUDA events collected: 2,360.
Analysis 177939 00:02.465 Number of OS runtime libraries events collected: 5,287.
Injection 177939 00:03.995 Buffers holding CUDA trace data will be flushed on CudaProfilerStop() call. See --flush-on-cudaprofilerstop to control this behavior.
Injection 177939 00:04.006 Loaded CUPTI library: /usr/local/cuda-12.8/nsight-systems-2024.6.2/target-linux-x64/libcupti.so.12.8
Injection 177939 00:04.245 CUDA injection initialized successfully.
Injection 177939 00:05.051 NVTX injection initialized successfully.
Injection 177939 00:06.464 Number of CUPTI events produced: 2,478, CUPTI buffers: 50.**strong text**

The above is from a python process that I successfully profiled, but none of the above injection content appears in the profile of a cuda-compiled executable.
So I wonder if it is possible that there is no injection when executing the cuda executable? Given the lack of relevant information on the Internet, I don’t know how to check this problem.

Can you give me the nsys command line you ran?

One thing that I notice is that the system has a linux kernel paranoid level of 4, which is going to stop essentially all of the CPU profiling information that we get from the linux perf subsystem. Do you know if htat was a recent change?

Thanks for your reply, here is the nsys command line I used:

nsys profile --trace=osrt,cuda,nvtx --trace-fork-before-exec=true --cuda-graph-trace=node ./grouped_gemm

with the below output:

WARNING: CPU IP/backtrace sampling not supported, disabling.
Try the 'nsys status --environment' command to learn more.

WARNING: CPU context switch tracing not supported, disabling.
Try the 'nsys status --environment' command to learn more.

Collecting data...
Average time for 10 runs: 0.548506 ms
Memory bandwidth: 3439.141061 GB/s
Generating '/home/shixuan/HLSS/.tmp/nsys-report-ef3a.qdstrm'
[1/1] [========================100%] report1.nsys-rep
Generated:
    /home/shixuan/HLSS/test/test_mlp_computation/report1.nsys-rep

Unfortunately, the diagnostic summary of nsys is:

Furthermore, if I execute a python program (torch based) that does some GPU operations, there is normal nsys-rep output. the command is like:

nsys profile --trace=osrt,cuda,nvtx --trace-fork-before-exec=true --cuda-graph-trace=node python run.py

This makes me somewhat confident that it’s maybe not a issue from linux kernel paranoid level. Thanks though, I’ll ask the admin about this.

Am I reading that correctly? Your runs are averaging 1/2 a millisecond? Can you do a longer run?

I’m wondering if the run is so short that the CUPTI library is not having time to fully initialize.

Because you aren’t specifically turning off the CPU side sampling, Nsys is trying to run it. However, the paranoid level presents that. But that isn’t your problem.

Hi, the code actually runs the kernel 10 times, 0.5 ms each time. (5ms in total).
And I just re-tested and ran the kernel 1000 times, but the situation did not change.

I think the initialization of the CUPTI library should not be a problem? Because from my understanding, nsys will wait until the initialization is complete before starting to execute the real command.

@liuyis am I off base here?

Hi @chenhongyu2048, could you share the report file? Also, could you try a more recent Nsys version from Nsight Systems - Get Started | NVIDIA Developer just in case it’s something already fixed?

Of course. The nsys-rep file is as below. I got it by nsys 2025.1.1
report1.zip (141.7 KB)

Thank you. The report does look strange in that everything is empty.

Could you collect logs from Nsys for us to take a deeper look?

  1. Save the following to /tmp/nvlog.config
+ 100iwef   global
$ /tmp/nsight-sys.log
ForceFlush
Format $sevc$time|${name:0}|PID${pid:0}|TID${tid:0}|${file:0}:${line:0}[${sfunc:0}]:$text
  1. Add environment variable NVLOG_CONFIG_FILE=/tmp/nvlog.config when running Nsys. E.g.
export NVLOG_CONFIG_FILE=/tmp/nvlog.config 
nsys profile ...
  1. Run the collection.
  2. There should be a log file at /tmp/nsight-sys.log. Share it to us.

Thank you for your patience. The file is as follows:

nsight-sys.zip (135.1 KB)

Thank you. One thing I noticed from the log is that your system has TMPDIR environment variable set to “/home/shixuan/HLSS/.tmp”. Is that intentional?

Nsys should be able to handle even a non-default TMPDIR path like this, and I’m still checking if there’s anything wrong in our logic, but just sharing this initial finding in case that helps anything.

Yes, I have set TMPDIR. But nsight-sys.log is placed in /tmp folder when I generated it.

I’m wondering if there’s some permission issue with the /home/shixuan/HLSS/.tmp folder that prevented the intermediate profiling files to be written and/or read. Could you try creating a different folder and set it as TMPDIR and see if there’s any difference? Or, if possible, could you try using Nsys with sudo and see if there’s any difference?

If above doesn’t help, could you try another experiment:

  1. Run the following Nsys command:
nsys profile -t osrt -w false yes
  1. In a different terminal, run the following command. Please replace <your TMPDIR path for Nsys> to the actual path.
ls -lR --time-style=full-iso <your TMPDIR path for Nsys>/nvidia/nsight_systems/quadd_session_*
  1. Wait for 10 seconds and repeat step 2.

  2. Attach the outputs from step 2 & 3. You can kill the Nsys command.

The reason is because I’m seeing the log says the intermidiate profiling files stored at <your TMPDIR path for Nsys>/nvidia/nsight_systems/quadd_session_* are older than the beginning of the collecion time and therefore is discarded. I’m trying to figure out if it is actually too old or if there’s some bug in Nsys.

I tried this again:

echo $TMPDIR
/home/shixuan/HLSS/test_tmp
echo $NVLOG_CONFIG_FILE
/tmp/nvlog.config

test_tmp is my newly created folder.
content in nvlog.config is not changed.

then I run:

/home/shixuan/nsight-systems-2025.1.1/bin/nsys profile --trace=osrt,cuda,nvtx --trace-fork-before-exec=true --cuda-graph-trace=node ./grouped_gemm

I got the nsight-sys.log as below.
nsight-sys.zip (98.1 KB)

Then, since no quadd_session_* files or folders are generated in the /home/shixuan/HLSS/test_tmp/nvidia/nsight_systems folder, I cannot perform subsequent operations.