Nsight version 2023.1 for Visual Studio 2022 hits breakpoints incorrectly

Hello!

I have installed the latest CUDA 12.1 with 531.05 Drivers according to Gameworks Download Center | NVIDIA Developer, followed by an install of the installation of Nsight Visual Studio Edition 2023.1.0 (Windows). Both installs run without errors.

I am running Windows 10 with a GeForce GTX 1080. Here are the nvcc reports and nvidia-smi:

> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Feb__8_05:53:42_Coordinated_Universal_Time_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

> where nvcc
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcc.exe

> nvidia-smi
Wed Apr  5 13:36:56 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 531.14                 Driver Version: 531.14       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                      TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080       WDDM | 00000000:01:00.0  On |                  N/A |
| 31%   42C    P8                9W / 180W|   1419MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Then, I created a CUDA Runtime 12.1 Project inside VS2022, which opened a template project with one kernel.cu. Using the Local Windows Debugger runs the project correctly and the example kernel correctly executes. The code in question is the following:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

void addWithCudaSimple(int* c, const int* a, const int* b, unsigned int size);

__global__ void addKernelSimple(int* c, const int* a, const int* b)
{
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
}

int main()
{
    const int arraySize = 5;
    const int a[arraySize] = { 1, 2, 3, 4, 5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    int c[arraySize] = { 0 };

    // Add vectors in parallel.
    addWithCudaSimple(c, a, b, arraySize);

    printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
        c[0], c[1], c[2], c[3], c[4]);

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaDeviceReset();

    return 0;
}

// Helper function for using CUDA to add vectors in parallel.
void addWithCudaSimple(int* c, const int* a, const int* b, unsigned int size)
{
    int* dev_a = 0;
    int* dev_b = 0;
    int* dev_c = 0;
    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaSetDevice(0);

    // Allocate GPU buffers for three vectors (two input, one output)    .
    cudaMalloc((void**)&dev_c, size * sizeof(int));
    cudaMalloc((void**)&dev_a, size * sizeof(int));
    cudaMalloc((void**)&dev_b, size * sizeof(int));

    // Copy input vectors from host memory to GPU buffers.
    cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);

    // Launch a kernel on the GPU with one thread for each element.
    addKernelSimple<<<1, size>>>(dev_c, dev_a, dev_b);

    // Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }

    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaDeviceSynchronize();

    // Copy output vector from GPU buffer to host memory.
    cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);

Error:
    cudaFree(dev_c);
    cudaFree(dev_a);
    cudaFree(dev_b);
}

I tried to set the breakpoint on line int i = threadIdx.x; inside addKernelSimple, build the project and run it with Extensions → Nsight → Start CUDA Debugging (Next Gen), which resulted in hitting the breakpoint inside the start of the main() function, as if break on launch was enabled. Image of what happens:

I confirm that I am not using the break on launch option. It is disabled both in NVIDIA Nsight Options (under Extensions → Nsight → Options → CUDA → CUDA Debugger RunControl (Next-Gen)) and under Extensions → Nsight → Break On Launch.

Further observation: if I use “Run to Cursor” on the line int i = threadIdx.x; of the kernel, the breakpoint is correctly hit. I opened the Breakpoints window (Debug->Windows->Breakpoints).

  • When adding a breakpoint, a single breakpoint appears in the Breakpoints window, named KernelSimple.cu, line 10.
  • However, after running the code with the Nsight Next-Gen Debugger, the Breakpoints window displays a single breakpoint again, this time "KernelSimple.cu, line 15.
  • If I stop the debugger without using the “Run to cursor” feature to stop again inside the kernel, the breakpoint in Breakpoints is again listed as a simple unexpandable breakpoint: KernelSimple.cu, line10
  • However, if I stop the debugger after using the “Run to cursor” feature to stop the code inside the kernel, the breakpoint in Breakpoints is listed as an expandable breakpoint. If I expand it, I see two more sub-breakpoints: KernelSimple.cu, line 15 and KernelSimple.cu, line 10
    breakpoints

Important: If I disable the KernelSimple.cu,line15 breakpoint and run the Nsight Next Gen Debugger again, the debugger stops correctly inside the kernel, without stopping in main.
breakpoints2

Additional settings: CUDA C/C++ Project settings:

Common:
  - CUDA Runtime: Static CUDA runtime library (-cudart static)
  - NVCC Compilation Type: Generate hybrid object file (--compile)

Device:
  - Interleave source in PTX: No
  - Code Generation: compute_52, sm_52
  - Enable Virtual Arch in Fatbin: Yes
  - Generate GPU Debug Information: Yes (-G)
  - Generate Line Number Information: No
  - Max Used Register: 0
  - Verbose PTXAS Output: No
  - Split compilation: Default

Is this expected behaviour, or is this a bug? Thank you for the time reading and the help in advance,

Aljoša Škorjanc

1 Like