NSIGHT Debug CUDA grid launch failed

Hello every one,

I have installed CUDA and started working with it (successfully launched several .cu files, tested nvcc and nvprof).

I am now at the step where I would like to debug my work in CUDA (access to kernels).

I have tried several attempts, but with always the same output:CUDA grid launch failed.

Let me detail this on one example, the MatrixMul CUDA sample, which is detailed in the Nsight VIsual Studio User Guide :
https://docs.nvidia.com/nsight-visual-studio-edition/2019.3/Nsight_Visual_Studio_Edition_User_Guide.htm#Debugging_CUDA_Application.htm%3FTocPath%3DCUDA%2520Debugger|_____1

My system (as detailed in the Nsight ->Windows->System Info) is:

Name Intel(R) Core™ i7-4700MQ CPU @ 2.40GHz
Architecture x64
Frequency 2 394 MHz
Number of Cores 8
Page Size 4 096
Total Physical Memory 7 993.00 MB
Available Physical Memory 2 091.00 MB
Hybrid Graphics Enabled False
Version Name Windows 10 Enterprise
Version Number 10.0.19041
Nsight Version 2019.1.0.19017
Nsight Edition Standard
Visual Studio Version 14.0

For further information on my equipment, the deviceQuery CUDA sample returns:
CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “Quadro K3100M”
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4096 MBytes (4294967296 bytes)
( 4) Multiprocessors, (192) CUDA Cores/MP: 768 CUDA Cores
GPU Max Clock rate: 706 MHz (0.71 GHz)
Memory Clock rate: 1600 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS

I have followed all the instructions on the Nsight User Guide in the CUDA Debugger Walkthrough : Debugging a CUDA application, and when I launch the Nvidia CUDA Debugger (Legacy) the Nsight Output is:

CUDA context created : 1e9fa5a26e0
CUDA module loaded: 1e9fd0c20d0 matrixMul.cu
CUDA grid launch failed: CUcontext: 2104439219936 CUmodule: 2104484438224 Function: _Z13MatrixMulCUDAILi32EEvPfS0_S0_ii

I have tried to increase the TDR Delay to 10 in the Nsight monitor option, without any success. I give you below my setting:

I am stuck on this, and any help or suggestion would be greatly appreciated.