The way to debug the kernel function code of CUDA Python in Visual Studio in the Win11 environment

I am developing a program based on CUDA Python recently. Unfortunately, I haven’t found a suitable way to debug the kernel function yet. Is there any way to debug the kernel function code of CUDA Python in Visual Studio in the Win11 environment? My environment is Visual Studio 2019 with CUDA 12.3 and Nsight for VS already installed.

Hi, @tuberge

Thanks for using our tools. Here are 2 questions

  1. Is this PyTorch, or something else?
  2. Are the kernels you want to debug ones that you wrote (and can be complied with -G), or those from a library (compiled w/o -G)?

Thank you very much for your response, @veraj. First of all, this is just a regular Python program, and I hope to compile the CUDA program through nvrtc and call it in Python. By the way, this method comes from the website(https://CUDA Python 12.3.0 documentation). Below is my sample code, which also comes from the official example. My question is whether there is a way to debug into the CUDA kernel function, such as entering the extern “C” global void timedReduction(const float *hinput, float *output, clock_t *timer) inside the sample code. As for the second question, I have also tried the ‘-G’ option, but it seems to have no effect after compilation. The debugger will skip the clock_nvrtc=‘’‘…’‘’ part directly. In addition, yesterday I also tried the method of 'Debug->Attach to Process->Attach to ‘GPU (Nsight VSE) Code’ in Visual Studio, which comes from (https://Attach NSIGHT CUDA debugger to a Dll from NVCC?), but it also has no effect. Looking forward to your further reply!

# Copyright 2021-2023 NVIDIA Corporation.  All rights reserved.
#
# Please refer to the NVIDIA end user license agreement (EULA) associated
# with this source code for terms and conditions that govern your use of
# this software. Any use, reproduction, disclosure, or distribution of
# this software and related documentation outside the terms of the EULA
# is strictly prohibited.
import numpy as np
from cuda import cuda
from examples.common import common
from examples.common.helper_cuda import checkCudaErrors, findCudaDevice

clock_nvrtc = '''\
extern "C" __global__  void timedReduction(const float *hinput, float *output, clock_t *timer)
{
    // __shared__ float shared[2 * blockDim.x];
    extern __shared__ float shared[];

    const int tid = threadIdx.x;
    const int bid = blockIdx.x;

    if (tid == 0) timer[bid] = clock();

    // Copy hinput.
    shared[tid] = hinput[tid];
    shared[tid + blockDim.x] = hinput[tid + blockDim.x];

    // Perform reduction to find minimum.
    for (int d = blockDim.x; d > 0; d /= 2)
    {
        __syncthreads();

        if (tid < d)
        {
            float f0 = shared[tid];
            float f1 = shared[tid + d];

            if (f1 < f0)
            {
                shared[tid] = f1;
            }
        }
    }

    // Write result.
    if (tid == 0) output[bid] = shared[0];

    __syncthreads();

    if (tid == 0) timer[bid+gridDim.x] = clock();
}
'''

NUM_BLOCKS = 64 
NUM_THREADS  = 256

def main():
    print("CUDA Clock sample")

    timer = np.empty(NUM_BLOCKS * 2, dtype='int64')
    hinput = np.empty(NUM_THREADS * 2, dtype='float32')

    for i in range(0, NUM_THREADS * 2):
        hinput[i] = i

    devID = findCudaDevice()
    kernelHelper = common.KernelHelper(clock_nvrtc, devID)
    kernel_addr = kernelHelper.getFunction(b'timedReduction')

    dinput = checkCudaErrors(cuda.cuMemAlloc(np.dtype(np.float32).itemsize * NUM_THREADS * 2))
    doutput = checkCudaErrors(cuda.cuMemAlloc(np.dtype(np.float32).itemsize * NUM_BLOCKS))
    dtimer = checkCudaErrors(cuda.cuMemAlloc(np.dtype(np.int64).itemsize * NUM_BLOCKS * 2))
    checkCudaErrors(cuda.cuMemcpyHtoD(dinput, hinput, np.dtype(np.float32).itemsize * NUM_THREADS * 2))



    arr = ((dinput, doutput, dtimer),
           (None, None, None))

    checkCudaErrors(cuda.cuLaunchKernel(kernel_addr,
                                        NUM_BLOCKS, 1, 1,  # grid dim
                                        NUM_THREADS, 1, 1, # block dim
                                        np.dtype(np.float32).itemsize * 2 *NUM_THREADS, 0, # shared mem, stream
                                        arr, 0)) # arguments

    checkCudaErrors(cuda.cuCtxSynchronize())
    checkCudaErrors(cuda.cuMemcpyDtoH(timer, dtimer, np.dtype(np.int64).itemsize * NUM_BLOCKS * 2))
    checkCudaErrors(cuda.cuMemFree(dinput))
    checkCudaErrors(cuda.cuMemFree(doutput))
    checkCudaErrors(cuda.cuMemFree(dtimer))

    avgElapsedClocks = 0.0

    for i in range(0,NUM_BLOCKS):
        avgElapsedClocks += timer[i + NUM_BLOCKS] - timer[i]

    avgElapsedClocks = avgElapsedClocks/NUM_BLOCKS;
    print("Average clocks/block = {}".format(avgElapsedClocks))

if __name__=="__main__":
    main()

Hi, @tuberge

You can create a dummy CUDA project and set the launch executable as the Python interpreter, and your python sample as command line arguments.
Then you should be able to start next-gen debug with "Break on launch” enabled. But note you’ll get “frame not in the module”, and you can only debug in disassembly view.

Thank you very much for your constructive feedback @veraj . Following your suggestion, I have already reproduced the debugging process. However, reading disassembled code is a bit difficult for me. It would be very useful if there were tools that could map disassembled code to CUDA code and even view the corresponding variable values. Thank you again for your reply.

I am sorry this is not supported in Nsight VSE now. We’ll track your request internally, and if there is any update, I will let you know.

Thank you for your reply @veraj . I’m very glad that I could be of help. The replies so far have been great helpful and the development process is going well so far. I’m looking forward to the future support of cuda-python in Nsight VSE. Thank you once again for your help!

You are welcome ! Any new issues, feel free to start a new topic and we’ll do our best to help !