Cannot profile RTX 2060 KO (TU104) with CUDA 11.0 on windows and ubuntu

Hello,

From my readings of the documentations (https://developer.nvidia.com/nsight-compute) NSIGHT Compute and nvprof should be able to produce detailed profiling metrics for any TU1XX chip.
However, it does not work with my RTX 2060.

nvprof can run with “summary” options (just regular tracing).

nvprof.exe 'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\extras\demo_suite\vectorAdd.exe'
# or
nvprof.exe -o output.nvvp -f  'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\extras\demo_suite\vectorAdd.exe'

But advanced profiling does not work:


nvprof.exe -o output.nvvp -f --analysis-metrics  'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\extras\demo_suite\vectorAdd.exe' 

Producing the following error (Warning) in the output, and does not generate any detailed information about the executed kernels.

Toolkit\CUDA\v11.0\extras\demo_suite\vectorAdd.exe'
======== Warning: Skipping profiling on device 0 since profiling is not supported on devices with compute capability 7.5 and higher.
                  Use NVIDIA Nsight Compute for GPU profiling and NVIDIA Nsight Systems for GPU tracing and CPU sampling.
                  Refer https://developer.nvidia.com/tools-overview for more details.

======== Warning: The option --aggregate-mode on has no effect. The --aggregate-mode <on|off> option applies to --events and --metrics options that follow it.
======== Warning: The option --aggregate-mode off has no effect. The --aggregate-mode <on|off> option applies to --events and --metrics options that follow it.
[Vector addition of 50000 elements]
==19508== NVPROF is profiling process 19508, command: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\extras\demo_suite\vectorAdd.exe
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
==19508== Generated result file: C:\Users\Agostini\output.nvvp

I have also tried to dual boot with Ubuntu 20.04 and I receive the same error. Furthermore in windows, “MS Visual Studio 2019 > NSIGHT > Start performance analysis…” detects the device but upon profiling execution the following error occurs.

Attempted to perform CUDA trace on an unsupported CUDA device. Serialized kernel trace mode has been used.

I have also tried to use NSIGHT Compute in both windows and ubuntu without success (gives an error but it is not descriptive).

Is the RTX 2060 KO (TU104) supported by CUDA 11.0 tools?
What consumer cards from the Turing generation support detailed profiling?

Thank you in advance

Hi N B Agostini,

Tools nvprof and NVIDIA Visual Profiler don’t support profiling events and metrics on Turing and later GPU architectures. These tools support tracing (timeline) activities on Turing. These limitations are documented in the profiler guide in the section https://docs.nvidia.com/cuda/profiler-users-guide/index.html#migrating-to-nsight-tools.

Nsight Compute supports profiling on Turing TU1xx cards. Did you try GUI or CLI? Can you please paste the full error log?
Do you encounter the error message “ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device …”. Profiling tools require users to start the profiling as root user, or have admin to give profiling permission to non-root users.
Related links:


Thank you for sharing this link and information. I did not come across it under my investigations and it explains a lot.

Following your suggestion, I am currently trying NSIGHT compute on windows.

My setup is:
Profiling

Target platform - Windows
Application  Executable - C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.0/extras/demo_suite/vectorAdd.exe

Activity - Profile
Outputfile - C:/Users/Agostini/nvvp_workspace/nsight/output
Force Overwrite - Yes
Target Process - Application Only
Commnadline (auto-generated) - "C:/Program Files/NVIDIA Corporation/Nsight Compute 2020.1.0/target/windows-desktop-win7-x64/ncu.exe" --export C:/Users/Agostini/nvvp_workspace/nsight/output --force-overwrite --target-processes application-only --kernel-regex-base function --launch-skip-before-match 0 --section LaunchStats --section Occupancy --section SpeedOfLight --sampling-interval auto --sampling-max-passes 5 --sampling-buffer-size 33554432 --profile-from-start 1 --cache-control all --clock-control base --apply-rules yes "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.0/extras/demo_suite/vectorAdd.exe"

The profile attempts to connect to different IP:ports,

==PROF== Attempting to connect to ncu-ui at 10.15.187.74:50160...
==PROF== Connected to ncu-ui at 10.15.187.74:50160.
[Vector addition of 50000 elements]
==PROF== Connected to process 13200 (C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\extras\demo_suite\vectorAdd.exe)
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
==PROF== Profiling "vectorAdd" - 1: 0%..

At this point the screen goes black for 2 seconds and when it re-opens I observe this error:

Launched process: ncu.exe (pid: 1916)
Launch succeeded.
Profiling...
==PROF== Connected to process 15284 (C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\extras\demo_suite\vectorAdd.exe)

==PROF== Profiling "vectorAdd" - 1: 
==ERROR== Error: UnknownError

==PROF== Disconnected from process 15284

==ERROR== The application returned an error code (1).

==ERROR== An error occurred while trying to profile.

==PROF== Report: output.ncu-rep

Process terminated.
Loading report file C:/Users/Agostini/nvvp_workspace/nsight/output.ncu-rep...

And the file C:/Users/Agostini/nvvp_workspace/nsight/output.ncu-rep opens, however the Page: Details is incomplete with lots of yellow exclamation marks for all Speed of Light Metrics

Additional notes:

  • I am profiling the same GPU that is rendering the display to my monitor
  • I have “allow access to GPU performance counters to all users” on
  • I have tried to run NSIGHT COMPUTE as Admistrator

Running the command line in CMD produces the same error.

[Vector addition of 50000 elements]
==PROF== Connected to process 5184 (C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\extras\demo_suite\vectorAdd.exe)
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
==PROF== Profiling "vectorAdd" - 1: 0%....50%....100% - 9 passes

==ERROR== Error: UnknownError
Copy output data from the CUDA device to the host memory
Failed to copy vector C from device to host (error code the launch timed out and was terminated)!
==PROF== Disconnected from process 5184
==ERROR== The application returned an error code (1).
==ERROR== An error occurred while trying to profile.
==PROF== Report: output.ncu-rep

I can possible try the same on ubuntu (without display running) later.

I am unsure what to do now. Let me know what I should try next.

It appears that it is not possible to profile the GPU which is also rendering image to your monitors (Xorg or Windows UI). Profiling works if gpu is just rendering a virtual terminal (Ctrl+Alt+FX).

I switched to Ubuntu 20.04 an tried NSIGHT-Compute UI with root privileges, but my screen freezes during profiling and computer restarts (at the same spot on which windows flashes a black screen). The same happens if I tried the command line interface in ubuntu.

However, if I switch the a Virtual Terminal (Ctrl+Alt+F3) and execute the ncu with sudo privileges, I am finally able to collect kernel metrics.

# Monitor is receiving Virtual Terminal image from the GPU, but xorg process is idle.
sudo /usr/local/cuda-11/bin/ncu -o report /usr/loca/cuda-11/extras/demo_suite/vectorAdd

@mjain Is it a bug, or we are not meant to do detailed profiling in the same GPU that renders the display?

Thank you,
Nico

You can profile on the same GPU that is driving the display. However, there are more restrictions related with that. First of all, when on Windows, the operating system will forcefully stop any long-running kernel that appears to hang the display, which includes kernels running longer due to profiling overhead. You can check https://docs.nvidia.com/nsight-compute/ReleaseNotes/index.html#known-issues for details.

Enabling certain metrics can cause GPU kernels to run longer than the driver’s watchdog time-out limit. In these cases the driver will terminate the GPU kernel resulting in an application error and profiling data will not be available. Please disable the driver watchdog time out before profiling such long running CUDA kernels.

In addition, results for certain metrics might differ significantly, due to influence from other contexts on the same GPU that cannot be completely isolated

Profiling a kernel while other contexts are active on the same device (e.g. X server, or secondary CUDA or graphics application) can result in varying metric values for L2/FB (Device Memory) related metrics. Specifically, L2/FB traffic from non-profiled contexts cannot be excluded from the metric results. To completely avoid this issue, profile the application on a GPU without secondary contexts accessing the same device (e.g. no X server on Linux).

Thank you for the suggestion @felix_dt .
I followed the windows instructions:

  1. Opened Nsight Monitor with Run as administrator
  2. Clicked on the tray icon
  3. Clicked on “Nsight Monitor options”
  4. In General > Microsoft Display Driver, I changed “WDDM TDR Enabled” to “False”
  5. Reboot the machine to apply changes

Then I opened Nsight Compute and profiled for SOL metrics in a simple kernel. However the screen got stuck and I had to hard reset the PC.

Changing the “WDDM TDR Delay” from 2 to 15 with “WDDM TDR Enabled” to “True”, upon profiling with Nsight Compute makes the screen to get stuck for 15 seconds, then go black for 1-2 seconds, and then resuming the session but failing the profiling.

I am not sure what is wrong. I have used Nsight Compute from CUDA 10.2 and 11.0 with Windows 2004, running with a 2080TI (WDDM running to render remote desktop sessions) without problems. I am unsure why this current system, with a 2060KO, can’t be profiled.

Any additional ideas?

Thank you in advance

EDIT: “added the reboot the machine step”

Did you reboot the machine in-between changing this option?

Yes! I did, I will update my message for future reference.