Nsight version 2023.1 for Visual Studio 2022 hits breakpoints incorrectly

skorjanc.aljosa · April 5, 2023, 11:34am

Hello!

I have installed the latest CUDA 12.1 with 531.05 Drivers according to Gameworks Download Center | NVIDIA Developer, followed by an install of the installation of Nsight Visual Studio Edition 2023.1.0 (Windows). Both installs run without errors.

I am running Windows 10 with a GeForce GTX 1080. Here are the nvcc reports and nvidia-smi:

> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Feb__8_05:53:42_Coordinated_Universal_Time_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

> where nvcc
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcc.exe

> nvidia-smi
Wed Apr  5 13:36:56 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 531.14                 Driver Version: 531.14       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                      TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080       WDDM | 00000000:01:00.0  On |                  N/A |
| 31%   42C    P8                9W / 180W|   1419MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Then, I created a CUDA Runtime 12.1 Project inside VS2022, which opened a template project with one kernel.cu. Using the Local Windows Debugger runs the project correctly and the example kernel correctly executes. The code in question is the following:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

void addWithCudaSimple(int* c, const int* a, const int* b, unsigned int size);

__global__ void addKernelSimple(int* c, const int* a, const int* b)
{
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
}

int main()
{
    const int arraySize = 5;
    const int a[arraySize] = { 1, 2, 3, 4, 5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    int c[arraySize] = { 0 };

    // Add vectors in parallel.
    addWithCudaSimple(c, a, b, arraySize);

    printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
        c[0], c[1], c[2], c[3], c[4]);

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaDeviceReset();

    return 0;
}

// Helper function for using CUDA to add vectors in parallel.
void addWithCudaSimple(int* c, const int* a, const int* b, unsigned int size)
{
    int* dev_a = 0;
    int* dev_b = 0;
    int* dev_c = 0;
    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaSetDevice(0);

    // Allocate GPU buffers for three vectors (two input, one output)    .
    cudaMalloc((void**)&dev_c, size * sizeof(int));
    cudaMalloc((void**)&dev_a, size * sizeof(int));
    cudaMalloc((void**)&dev_b, size * sizeof(int));

    // Copy input vectors from host memory to GPU buffers.
    cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);

    // Launch a kernel on the GPU with one thread for each element.
    addKernelSimple<<<1, size>>>(dev_c, dev_a, dev_b);

    // Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }

    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaDeviceSynchronize();

    // Copy output vector from GPU buffer to host memory.
    cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);

Error:
    cudaFree(dev_c);
    cudaFree(dev_a);
    cudaFree(dev_b);
}

I tried to set the breakpoint on line int i = threadIdx.x; inside addKernelSimple, build the project and run it with Extensions → Nsight → Start CUDA Debugging (Next Gen), which resulted in hitting the breakpoint inside the start of the main() function, as if break on launch was enabled. Image of what happens:

I confirm that I am not using the break on launch option. It is disabled both in NVIDIA Nsight Options (under Extensions → Nsight → Options → CUDA → CUDA Debugger RunControl (Next-Gen)) and under Extensions → Nsight → Break On Launch.

Further observation: if I use “Run to Cursor” on the line int i = threadIdx.x; of the kernel, the breakpoint is correctly hit. I opened the Breakpoints window (Debug->Windows->Breakpoints).

When adding a breakpoint, a single breakpoint appears in the Breakpoints window, named KernelSimple.cu, line 10.
However, after running the code with the Nsight Next-Gen Debugger, the Breakpoints window displays a single breakpoint again, this time "KernelSimple.cu, line 15.
If I stop the debugger without using the “Run to cursor” feature to stop again inside the kernel, the breakpoint in Breakpoints is again listed as a simple unexpandable breakpoint: KernelSimple.cu, line10
However, if I stop the debugger after using the “Run to cursor” feature to stop the code inside the kernel, the breakpoint in Breakpoints is listed as an expandable breakpoint. If I expand it, I see two more sub-breakpoints: KernelSimple.cu, line 15 and KernelSimple.cu, line 10

Important: If I disable the KernelSimple.cu,line15 breakpoint and run the Nsight Next Gen Debugger again, the debugger stops correctly inside the kernel, without stopping in main.
breakpoints2

Additional settings: CUDA C/C++ Project settings:

Common:
  - CUDA Runtime: Static CUDA runtime library (-cudart static)
  - NVCC Compilation Type: Generate hybrid object file (--compile)

Device:
  - Interleave source in PTX: No
  - Code Generation: compute_52, sm_52
  - Enable Virtual Arch in Fatbin: Yes
  - Generate GPU Debug Information: Yes (-G)
  - Generate Line Number Information: No
  - Max Used Register: 0
  - Verbose PTXAS Output: No
  - Split compilation: Default

Is this expected behaviour, or is this a bug? Thank you for the time reading and the help in advance,

Aljoša Škorjanc

Topic		Replies	Views
Nsight CUDA Debugging: breakpoint will not currently be hit Nsight Visual Studio Edition	1	784	January 30, 2021
Nsight 5.2/VS2012 C++/CUDA 7.5 Crash Nsight Visual Studio Edition	20	2665	March 23, 2017
Nsight 3.1 on Win 7 x64 crashes when cuda breakpoint is hit Nsight Visual Studio Edition	4	1349	February 3, 2014
Breakpoints not hit Nsight Visual Studio Edition	9	2894	March 7, 2013
Break points not hit in CUDA kernel code with Nsight debugger in Visual Studio 2017 Nsight Visual Studio Edition kernel	0	982	October 11, 2021
How to set breakpoints Nsight 5.3/CUDA 9.0/Visual Studio 2017? Nsight Visual Studio Edition	12	4078	December 3, 2018
Breakpoint in kernel code not being hit when Debugging in Nsight for Visual Studio Nsight Visual Studio Edition	10	5236	March 24, 2015
Nsight Visual Studio Ignoring Break Points Nsight Visual Studio Edition	6	2535	November 3, 2016
RE : NSight skipping breakpoints while debugging Nsight Visual Studio Edition	4	1387	May 1, 2013
Nsight unresponsding to break points Nsight Visual Studio Edition	9	3680	April 8, 2013

Nsight version 2023.1 for Visual Studio 2022 hits breakpoints incorrectly

Related topics