"NvCudaDebuggerInjector.dll" fails to load when running cuda next-gen debugger

I’m trying to debug my cuda kernel with cuda next-gen debugger.

My breakpoints are’nt hit. I checked the “modules” window in VS, i found that the mentioned DLL fails to load. The output window shows that there is an attempt to load it , and then it is immediately unloaded.
please advice

Do you have up to date drivers (get version from nvidia-smi)?
Are you running on an supported GPU (pascal or higher for the Next-Gen debugger)?
Do you have multiple GPUs (if so, is CUDA_VISIBLE_DEVICES set to the correct GPU)?
What version of Nsight VSE are you using?
Does the program run properly without the debugger (or with the Legacy debugger if on a Kepler or Maxwell GPU)?
What version of win10 are you running (must be Redstone 3 or later for Next-Gen debugger)?

Although, there a mechanisms in place to clean up or run multiple instances, it is possible that an orphaned NvDebugAgent.exe is not terminated. Search for this in the Task Manager and manually kill it if you find it while the debugger is not running.

Driver: 445.87
Card: RTX 2080
Cuda: 10.2
Nsight version 2021.1
OS: Win 10 Enterprise version 10.0.18363 build 18363

running on a single GPU

Can you verify that
REG QUERY "HKLM\SOFTWARE\NVIDIA Corporation\GPUDebugger" /v EnableInterface is set
(and set it if not)?

It it possible to bring you CUDA Toolkit up to the most recent 11.3 version and your driver up to the recommended r465? It would be best to get all three pieces of the software stack at the same level.

hello

first of all TNX for your help

The mentioned flag was already on

Since my last post, I upgraded everything

cuda 11.3

newest driver (465…)

I assume you’re seeing the same behavior after all of your updates?

yes,

I am

It feels like there a dependency missing.
Can your procmon or a similar app to check this out?

Thank you @inbaltomer. We need more information about the specifics of your scenario to better understand why NvCudaDebuggerInjection.dll is not being loaded. The following is a modified version of the matrixMul CUDA sample that loads from a DLL with breakpoints working as expected:

matrixMul.zip (7.6 KB)

What is different in your usage scenario?

Thank you.