Nsight nsys not collecting any CUDA kernel data (2023.1.2.43-32377213v0)

I’m using NVIDIA Nsight Systems version 2023.1.2.43-32377213v0 to profile a GPU run on an Geforce RTX 3080 Ti laptop GPU in the following way.

$ nsys profile -s none --cpuctxsw none --trace=cuda -o gpu_  <app>
...
Generating '/tmp/nsys-report-9419.qdstrm'
[1/1] [========================100%] gpu_.nsys-rep

$ nsys stats --report gpukernsum gpu_.nsys-rep 
Processing [gpu_.sqlite] with [/opt/nvidia/nsight-systems/2023.1.2/host-linux-x64/reports/gpukernsum.py]... 
SKIPPED: gpu_.sqlite does not contain CUDA kernel data.

However, as shown above, nsys doesn’t collect any GPU kernel data in the profile. I can confirm that the app is definitely running on the GPU as evidenced by nvidia-smi’s output (I’m watching it with watch -n 0.2 nvidia-smi and I can see the app running, the GPU memory and compute utilization increasing/changing constantly and its temperature/power also reflecting the run.)

Other relevant information:

±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3080 T… On | 00000000:01:00.0 Off | N/A |
| N/A 60C P8 21W / 115W| 134MiB / 16384MiB | 49% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2289 G /usr/lib/xorg/Xorg 133MiB |
±--------------------------------------------------------------------------------------+

Hi @uday1,

Could you open the report with the Nsight Systems GUI and check if there are Diagnostic messages related to CUDA? You can check this in the Diagnostics Summary view.

Is it possible to share the report?

Could you provide some more information about the application? Is it a python application?
Could you run one of the CUDA samples and see if CUDA traces are collected for that?

I just ran the standard CUDA bandwidth test from /usr/local/cuda-11.8/extras/demo_suite/bandwidthTest. Nothing was collected.

CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11491.0

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12822.8

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			445886.2

Result = PASS

gpu_.sqlite (680 KB)
gpu_.nsys-rep (256.1 KB)

nsys report data is attached.

Thanks for providing the additional information.

The reason for the gpukernsum output is that the specific CUDA sample, bandwidthTest, does not launch kernels. It mainly allocates memory on the GPU and transfers data between the host and the device, or between devices.

If you use other available scripts, you will be able to see CUDA activity.
E.g., nsys stats --report cuda_api_sum gpu_.nsys-rep

You can study the code for the specific CUDA sample following this Github link.

But cuda_api_sum doesn’t generate any output either for me!

$ nsys stats -q --report cuda_api_sum --format table gpu_.nsys-rep
$

Does the above yield any output for you on the file I attached?

You are right, let me try to reproduce this.

1 Like

I couldn’t reproduce this on my end, CUDA is being traced.
Using the same CUDA sample and

  • NVIDIA GeForce RTX 3080 Ti
  • Driver 530.30.02
  • CUDA 11.8
  • Nsight Systems 2023.1.2.43-32377213v0

Are you able to see CUDA traces when profiling other samples?
Is there anything special on your setup? Are you using a VM?

Does updating to the latest Nsight Systems 2023.2.1 collect CUDA traces?

To continue debugging, could you please collect and share logs for this profiling? To collect logs please follow these steps:

  1. Save the following content to /tmp/nvlog.config:
+ 100iwef global

$ /tmp/nsight-sys.log

ForceFlush

Format $sevc$time|${name:0}|${tid:5}|${file:0}:${line:0}[${sfunc:0}]:$text
  1. Add --env-var=NVLOG_CONFIG_FILE=/tmp/nvlog.config to your Nsys command line. E.g. nsys profile --env-var=NVLOG_CONFIG_FILE=/tmp/nvlog.config -s none --cpuctxsw=none --trace=cuda -o gpu_ /usr/local/cuda-11.8/extras/demo_suite/bandwidthTest
  2. Run a collection. There should be logs at /tmp/nsight-sys.log. Share this file.

Do you have a use case where you need to use a BETA driver? Maybe using a recommended version would be more stable.

I’m using CUDA 12.1 (as opposed to 11.8 in your setup). Also, it’s a 3080 Ti Laptop GPU that I have here.

NVIDIA GeForce RTX 3080 Ti Laptop GPU
Driver 530.30.02
CUDA 12.1
Nsight Systems 2023.1.2.43-32377213v0
Ubuntu 22.04 LTS

There is nothing special in my setup. I’m running stock Ubuntu 22.04. No VMs.

No CUDA traces are seen on profiling anything IIUC; the output of nsys stats on the generated report is always empty.

2023.2.1 is not yet available via apt from NVIDIA’s repos; so I can’t readily install it. I’d like to get it working on this setup as much as possible.

Followed the remaining steps you suggested to collect logs; attached.

Reg. the BETA driver: I didn’t understand the question. Am I using a non-stable version of some driver? All of the packages are installed from official NVIDIA Ubuntu repos.
nsight-sys.log (41.2 KB)

Thanks for providing the log file! I can’t see anything that leads to the cause of the issue.

The next step would be to try Nsight Compute, ncu, and see if that tool is able to trace CUDA. If you have installed the CUDA toolkit you should have ncu already on your device. Otherwise you can install it with sudo apt install cuda-nsight-compute-12-1.

You can try to profile vectorAdd with ncu, e.g., ncu -o vectorAdd_profile /usr/local/cuda-11.8/extras/demo_suite/vectorAdd. You could open the collected report with the Nsight Compute GUI to verify that CUDA traces were collected. Or share the report here.

All of the packages are installed from official NVIDIA Ubuntu repos.

I agree that 530.30.02 is the version distributed by the repo, you could get the latest recommended driver version from the following link, Official Drivers | NVIDIA

The currently installed driver though should not be an issue. Nsight Systems is expected to work on your setup.

The latest version of Nsight Systems can be found at this link. You may need to create a free account to access the content.

You can download either Nsight Systems 2023.2.1 (Linux Host .run Installer), an installation script. It will not add dependencies. You can easily remove it by deleting the whole directory of the installation later.
Or Nsight Systems 2023.2.1 (Linux Host .deb Installer), which can be installed as a deb package.

ncu isn’t profiling anything either.

$ ncu --target-processes all /usr/local/cuda-11.8/extras/demo_suite/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

==PROF== Connected to process 77986 (/usr/local/cuda-11.8/extras/demo_suite/bandwidthTest)
 Device 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11526.5

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12849.3

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			420266.4

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
==PROF== Disconnected from process 77986
==WARNING== No kernels were profiled.```

Hi @uday1 ,

This is expected behavior, since bandwidthTest does not launch any kernels.

Please run ncu -o vectorAdd_profile /usr/local/cuda-11.8/extras/demo_suite/vectorAdd and provide the report file vectorAdd_profile.ncu-rep here. Or the CLI output, if a report is not created.

By the way, Nsight Systems 2023.2.3, nsight-systems-2023.2.3, is available through apt, if you want to give the latest nsys version a try.

Thanks. I just realized that a bit later. Here’s the result of:
ncu -f -o vectorAdd_profile /usr/local/cuda-11.8/extras/demo_suite/vectorAdd

[Vector addition of 50000 elements]
==PROF== Connected to process 181163 (/usr/local/cuda-11.8/extras/demo_suite/vectorAdd)
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
==PROF== Profiling "vectorAdd" - 0: 0%....50%....100% - 9 passes
Copy output data from the CUDA device to the host memory
Test PASSED
Done
==PROF== Disconnected from process 181163
==PROF== Report: /home/uday/vectorAdd_profile.ncu-rep

Note: I had to run with elevated privileges to avoid immediately rebooting for the module setting on perf counter permissions to take effect.

vectorAdd_profile.ncu-rep (43.0 KB)

Thanks for providing the Nsight Compute report file. This narrows down the potential sources of this issue.

The next step would be to use a simple injection library while running the CUDA sample to see if CUPTI is the source of the issue.

To do that please:

  1. Download cuda-injection-library.tar.gz (970 KB) on your laptop.
  2. Extract the files from the archive, tar -xvf cuda-injection-library.tar.gz
  3. cd cuda-injection-library
  4. make
  5. LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64 CUDA_INJECTION64_PATH=./libToolsInjectionCuda.so /usr/local/cuda-11.8/extras/demo_suite/bandwidthTest >injection_log.txt 2>&1
  6. Share the injection_log.txt file, and the CLI output if there are errors.

Here is the resulting injection_log.txt – pasted inline below as well.
injection_log.txt (1.1 KB)

13:38:13.130.174|360360|Lib.cpp:566[InitializeInjection]: Initializing CUDA tracing
13:38:13.136.022|360360|Lib.cpp:348[EnableCollection]: Starting collection
13:38:13.136.181|360360|Lib.cpp:580[InitializeInjection]: CUDA tracing initialized
13:38:13.354.183|360360|Lib.cpp:514[AtExitHandler]: Flushing CUPTI buffers on exit
13:38:13.354.439|360360|Lib.cpp:515[AtExitHandler]: FATAL: cuptiActivityFlushAll(CUPTI_ACTIVITY_FLAG_FLUSH_FORCED) failed: CUPTI_ERROR_INVALID_DEVICE
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     12386.0

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     12844.9

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     257483.8

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Thanks a lot for providing the injection library logs! This shows that there is an issue with CUPTI.

Unfortunately I cannot reproduce this issue on my end, so we would need to collect more detailed error messages from CUPTI.

Could you please use this updated injection library and collect logs again?

  1. Download cuda-injection-library.tar.gz (950 KB) on your laptop.
  2. Extract the files from the archive, tar -xvf cuda-injection-library.tar.gz
  3. cd cuda-injection-library
  4. make
  5. LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64 CUDA_INJECTION64_PATH=./libToolsInjectionCuda.so /usr/local/cuda-11.8/extras/demo_suite/bandwidthTest >injection_log.txt 2>&1
  6. Share the injection_log.txt file, and the CLI output if there are errors.

Please make sure to use the updated tar archive, the one attached to this message.

Hi @uday1, we followed up with the CUPTI team internally and they have confirmed that it is a bug in CUPTI. From CUPTI 11.8 onwards, the support for “Geforce RTX 3080 Ti laptop GPU” was broken. They are working on a fix which will be made available in a future release of nsys. In the meantime, you could go back to the CUDA 11.7 driver (version 515) and use the latest nsys version, if needed.

Sorry for the inconvenience so far. I understand it has been frustrating. Your help with debugging is greatly appreciated.

We have a new version of nsys that contains the fix from CUPTI Nsight Systems | NVIDIA Developer
Please try it out and let us know if you are still running into problems.

Thanks, nsys 2023.3 resolves this (used the .deb installer). But this isn’t yet available in NVIDIA’s cuda repo for Ubuntu 22.04 for easy install.

You could use the .deb from https://developer.download.nvidia.com/devtools/repos
This has the latest version of nsys. The CUDA repo usually lags behind a bit and is slower to get updated.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.