Hello everyone, I hope to ask for help with the problem I encountered:
I wrote a piece of cuda code, compiled it with nvcc and hope to use nsys to profile it. However, nsys did not capture anything, and the analysis column showed:
No NVTX events collected. Does the process use NVTX?
No CUDA events collected. Does the process use CUDA?
No OS runtime libraries events collected. Does the process use OS runtime libraries?
I know clearly that my code has launched cuda kernels successfully, because the cudaevent timer has worked, and I also successfully profiled the kernel through ncu.
Strangely, I can only see a blank in the nsys timeline, not even a CPU trace.
But when I changed my execution program to python xxxx.py (which does some GPU operations through pytorch), profiling succeeded.
I’m currently using ubuntu24, cuda12.8, nsight-systems-2024.6.2, Driver Version: 570.133.20.
I can confirm that nsys was still working fine (for the exact same cuda program) just a day ago. But since I’m sharing a server with others, I have no way of knowing if they’ve made some changes to the system.
nsys status --all’s output:
Timestamp counter supported: Yes
CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 4
Linux Distribution = Ubuntu
Linux Kernel Version = 6.11.0-26-generic: OK
Linux perf_event_open syscall available: Fail
Sampling trigger event available: Fail
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): Fail
CPU Profiling Environment (system-wide): Fail
Network Profiling Environment Check
OFED version: Not Available
Network features' library dependencies: Fail
I would like to ask what I should do to check this problem? That is, nsys can profile a python program, but cannot profile a cuda executable file?
supplement: in the nsys-rep file’s diagnostics summary part, I can’t see anything like this:
Injection 177939 00:00.136 Common injection library initialized successfully.
Injection 177939 00:00.142 OS runtime libraries injection initialized successfully.
Analysis 00:02.465 Scheduling information is absent. The thread activity is deduced based on OS runtime libraries traces. This is inaccurate and does not take into account asynchronous interrupts and exception faults.
Analysis 177939 00:02.465 Number of NVTX events collected: 21.
Analysis 177939 00:02.465 Number of CUDA events collected: 2,360.
Analysis 177939 00:02.465 Number of OS runtime libraries events collected: 5,287.
Injection 177939 00:03.995 Buffers holding CUDA trace data will be flushed on CudaProfilerStop() call. See --flush-on-cudaprofilerstop to control this behavior.
Injection 177939 00:04.006 Loaded CUPTI library: /usr/local/cuda-12.8/nsight-systems-2024.6.2/target-linux-x64/libcupti.so.12.8
Injection 177939 00:04.245 CUDA injection initialized successfully.
Injection 177939 00:05.051 NVTX injection initialized successfully.
Injection 177939 00:06.464 Number of CUPTI events produced: 2,478, CUPTI buffers: 50.**strong text**
The above is from a python process that I successfully profiled, but none of the above injection content appears in the profile of a cuda-compiled executable.
So I wonder if it is possible that there is no injection when executing the cuda executable? Given the lack of relevant information on the Internet, I don’t know how to check this problem.
One thing that I notice is that the system has a linux kernel paranoid level of 4, which is going to stop essentially all of the CPU profiling information that we get from the linux perf subsystem. Do you know if htat was a recent change?
WARNING: CPU IP/backtrace sampling not supported, disabling.
Try the 'nsys status --environment' command to learn more.
WARNING: CPU context switch tracing not supported, disabling.
Try the 'nsys status --environment' command to learn more.
Collecting data...
Average time for 10 runs: 0.548506 ms
Memory bandwidth: 3439.141061 GB/s
Generating '/home/shixuan/HLSS/.tmp/nsys-report-ef3a.qdstrm'
[1/1] [========================100%] report1.nsys-rep
Generated:
/home/shixuan/HLSS/test/test_mlp_computation/report1.nsys-rep
Am I reading that correctly? Your runs are averaging 1/2 a millisecond? Can you do a longer run?
I’m wondering if the run is so short that the CUPTI library is not having time to fully initialize.
Because you aren’t specifically turning off the CPU side sampling, Nsys is trying to run it. However, the paranoid level presents that. But that isn’t your problem.
Hi, the code actually runs the kernel 10 times, 0.5 ms each time. (5ms in total).
And I just re-tested and ran the kernel 1000 times, but the situation did not change.
I think the initialization of the CUPTI library should not be a problem? Because from my understanding, nsys will wait until the initialization is complete before starting to execute the real command.
Thank you. One thing I noticed from the log is that your system has TMPDIR environment variable set to “/home/shixuan/HLSS/.tmp”. Is that intentional?
Nsys should be able to handle even a non-default TMPDIR path like this, and I’m still checking if there’s anything wrong in our logic, but just sharing this initial finding in case that helps anything.
I’m wondering if there’s some permission issue with the /home/shixuan/HLSS/.tmp folder that prevented the intermediate profiling files to be written and/or read. Could you try creating a different folder and set it as TMPDIR and see if there’s any difference? Or, if possible, could you try using Nsys with sudo and see if there’s any difference?
If above doesn’t help, could you try another experiment:
Run the following Nsys command:
nsys profile -t osrt -w false yes
In a different terminal, run the following command. Please replace <your TMPDIR path for Nsys> to the actual path.
ls -lR --time-style=full-iso <your TMPDIR path for Nsys>/nvidia/nsight_systems/quadd_session_*
Wait for 10 seconds and repeat step 2.
Attach the outputs from step 2 & 3. You can kill the Nsys command.
The reason is because I’m seeing the log says the intermidiate profiling files stored at <your TMPDIR path for Nsys>/nvidia/nsight_systems/quadd_session_* are older than the beginning of the collecion time and therefore is discarded. I’m trying to figure out if it is actually too old or if there’s some bug in Nsys.
I got the nsight-sys.log as below. nsight-sys.zip (98.1 KB)
Then, since no quadd_session_* files or folders are generated in the /home/shixuan/HLSS/test_tmp/nvidia/nsight_systems folder, I cannot perform subsequent operations.