Hello!
I have installed the latest CUDA 12.1 with 531.05 Drivers according to Gameworks Download Center | NVIDIA Developer, followed by an install of the installation of Nsight Visual Studio Edition 2023.1.0 (Windows). Both installs run without errors.
I am running Windows 10 with a GeForce GTX 1080. Here are the nvcc reports and nvidia-smi:
> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Feb__8_05:53:42_Coordinated_Universal_Time_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0
> where nvcc
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcc.exe
> nvidia-smi
Wed Apr 5 13:36:56 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 531.14 Driver Version: 531.14 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1080 WDDM | 00000000:01:00.0 On | N/A |
| 31% 42C P8 9W / 180W| 1419MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
Then, I created a CUDA Runtime 12.1 Project inside VS2022, which opened a template project with one kernel.cu. Using the Local Windows Debugger runs the project correctly and the example kernel correctly executes. The code in question is the following:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
void addWithCudaSimple(int* c, const int* a, const int* b, unsigned int size);
__global__ void addKernelSimple(int* c, const int* a, const int* b)
{
int i = threadIdx.x;
c[i] = a[i] + b[i];
}
int main()
{
const int arraySize = 5;
const int a[arraySize] = { 1, 2, 3, 4, 5 };
const int b[arraySize] = { 10, 20, 30, 40, 50 };
int c[arraySize] = { 0 };
// Add vectors in parallel.
addWithCudaSimple(c, a, b, arraySize);
printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
c[0], c[1], c[2], c[3], c[4]);
// cudaDeviceReset must be called before exiting in order for profiling and
// tracing tools such as Nsight and Visual Profiler to show complete traces.
cudaDeviceReset();
return 0;
}
// Helper function for using CUDA to add vectors in parallel.
void addWithCudaSimple(int* c, const int* a, const int* b, unsigned int size)
{
int* dev_a = 0;
int* dev_b = 0;
int* dev_c = 0;
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
cudaSetDevice(0);
// Allocate GPU buffers for three vectors (two input, one output) .
cudaMalloc((void**)&dev_c, size * sizeof(int));
cudaMalloc((void**)&dev_a, size * sizeof(int));
cudaMalloc((void**)&dev_b, size * sizeof(int));
// Copy input vectors from host memory to GPU buffers.
cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
// Launch a kernel on the GPU with one thread for each element.
addKernelSimple<<<1, size>>>(dev_c, dev_a, dev_b);
// Check for any errors launching the kernel
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
goto Error;
}
// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaDeviceSynchronize();
// Copy output vector from GPU buffer to host memory.
cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
Error:
cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);
}
I tried to set the breakpoint on line int i = threadIdx.x;
inside addKernelSimple, build the project and run it with Extensions → Nsight → Start CUDA Debugging (Next Gen), which resulted in hitting the breakpoint inside the start of the main() function, as if break on launch was enabled. Image of what happens:
I confirm that I am not using the break on launch option. It is disabled both in NVIDIA Nsight Options (under Extensions → Nsight → Options → CUDA → CUDA Debugger RunControl (Next-Gen)) and under Extensions → Nsight → Break On Launch.
Further observation: if I use “Run to Cursor” on the line int i = threadIdx.x;
of the kernel, the breakpoint is correctly hit. I opened the Breakpoints window (Debug->Windows->Breakpoints).
- When adding a breakpoint, a single breakpoint appears in the Breakpoints window, named KernelSimple.cu, line 10.
- However, after running the code with the Nsight Next-Gen Debugger, the Breakpoints window displays a single breakpoint again, this time "KernelSimple.cu, line 15.
- If I stop the debugger without using the “Run to cursor” feature to stop again inside the kernel, the breakpoint in Breakpoints is again listed as a simple unexpandable breakpoint: KernelSimple.cu, line10
- However, if I stop the debugger after using the “Run to cursor” feature to stop the code inside the kernel, the breakpoint in Breakpoints is listed as an expandable breakpoint. If I expand it, I see two more sub-breakpoints: KernelSimple.cu, line 15 and KernelSimple.cu, line 10
Important: If I disable the KernelSimple.cu,line15 breakpoint and run the Nsight Next Gen Debugger again, the debugger stops correctly inside the kernel, without stopping in main.
Additional settings: CUDA C/C++ Project settings:
Common:
- CUDA Runtime: Static CUDA runtime library (-cudart static)
- NVCC Compilation Type: Generate hybrid object file (--compile)
Device:
- Interleave source in PTX: No
- Code Generation: compute_52, sm_52
- Enable Virtual Arch in Fatbin: Yes
- Generate GPU Debug Information: Yes (-G)
- Generate Line Number Information: No
- Max Used Register: 0
- Verbose PTXAS Output: No
- Split compilation: Default
Is this expected behaviour, or is this a bug? Thank you for the time reading and the help in advance,
Aljoša Škorjanc