Cuda-gdb crash when trying to debug kernel launched through `cudaLaunchCooperativeKernel`

Im trying to debug a kernel that is launched using cudaLaunchCooperativeKernel since the kernel uses grid syncs. But when I try to run it inside cuda-gdb after setting a breakpoint inside the kernel, it crashes with the following stack trace

cuda-gdb/12/gdb/cuda/cuda-kernel.c:545: internal-error: kernels_start_kernel: Assertion `module' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
----- Backtrace -----
0x62e137 gdb_internal_backtrace()
0x9e2a04 internal_vproblem(internal_problem*, char const*, int, char const*, __va_list_tag*)
0x9e2d5c internal_verror(char const*, int, char const*, __va_list_tag*)
0xb96181 internal_error(char const*, int, char const*, ...)
0x6e3159 kernels_start_kernel(unsigned int, unsigned long, unsigned long, unsigned long, unsigned long, CuDim3, CuDim3, CUDBGKernelType, unsigned long, CUDBGKernelOrigin, bool, CuDim3)
0x6ed309 warp_get_kernel(unsigned int, unsigned int, unsigned int)
0x6e143c cuda_iterator_step(cuda_iterator_t*)
0x6e1c4e cuda_iterator_end(cuda_iterator_t*)
0x6f2ca2 cuda_breakpoint_hit_p(cuda_coords_t&)
0x5b3d57 cuda_nat_linux<amd64_linux_nat_target>::wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>)
0x817002 thread_db_target::wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>)
0x9b7b5a target_wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>)
0x7d76b9 do_target_wait_1(inferior*, ptid_t, target_waitstatus*, enum_flags<target_wait_flag>)
0x7e66c3 fetch_inferior_event()
0xb96c0c gdb_wait_for_event(int)
0xb96df6 gdb_do_one_event()
0x828b24 captured_command_loop()
0x82a3d4 gdb_main(captured_main_args*)
0x5689a4 main
---------------------
cuda-gdb/12/gdb/cuda/cuda-kernel.c:545: internal-error: kernels_start_kernel: Assertion `module' failed.

Tried with all cuda-11.7, cuda-12.0 and cuda-12.1 (nvcc and cuda-gdb both). But the result is the same.

Any help to work around this issue is appreciated

1 Like

Hi @BAdhi
Thank you very much for your report! Could you also share the nvidia-smi command output on your machine?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:01:00.0 Off |                  Off |
| N/A   34C    P0    87W / 500W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:41:00.0 Off |                  Off |
| N/A   36C    P0    88W / 500W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:81:00.0 Off |                  Off |
| N/A   32C    P0    84W / 500W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:C1:00.0 Off |                  Off |
| N/A   31C    P0    86W / 500W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+```

Hi @BAdhi
Thank you very much for reporting the issue. We were able to reproduce it locally, so we are working on the fix. I will update this post when CUDA version with the fix is released.

Hi @AKravets,

Is this bug fixed on cuda 12.2?

Hi @BAdhi,
Unfortunately the issue has not been fixed yet.

Hi @BAdhi
This issue should be fixed in the latest CUDA 12.3 release. You would need to have both 12.3 CUDA toolkit (https://developer.nvidia.com/cuda-downloads) and the R545 (12.3) GPU Driver installed to fix the crash.

1 Like

We will close this thread after 2 weeks, if the issue comes up again, please file a new topic and we will do our best to help !

Unfortunately I’m still seeing that cuda-gdb crashes for the same scenario.

Here is a sample

#include <iostream>
#include <cooperative_groups.h>

#define CHECK_CUDA(call)                                                       \
  {                                                                            \
    cudaError_t err = call;                                                    \
    if (cudaSuccess != err) {                                                  \
      fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n", __FILE__,  \
              __LINE__, cudaGetErrorString(err));                              \
      exit(EXIT_FAILURE);                                                      \
    }                                                                          \
  }


using namespace cooperative_groups;

__device__ void test(int* a)
{
    grid_group grid = this_grid();
    grid.sync(); // add a breakpoint here
}

__global__ void test2(int *a) {
    test(a);
}


int main()
{
    int device = 0;
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, device);

    int numThreads = 7;
    int * value = new int[7];
    int * dValue = nullptr;
    CHECK_CUDA(cudaMalloc(&dValue, sizeof(int) * 7));
    for (int i = 0; i < 7; ++i)
        value[i] = i + 10;
    CHECK_CUDA(cudaMemcpy(dValue, value, sizeof(int) * 7 , cudaMemcpyHostToDevice));

    void* args[] = {&dValue};

    CHECK_CUDA(cudaLaunchCooperativeKernel((void*)test2, 1, 1, args));

    CHECK_CUDA(cudaDeviceSynchronize());
}

nvidia-smi header :

NVIDIA-SMI 545.23.06 Driver Version: 545.23.06 CUDA Version: 12.3

cuda-gdb output :

warning: Could not find CUDA module for context_id 0x70eaa0 module_id 0x0
cuda-gdb/12/gdb/cuda/cuda-kernel.c:543: internal-error: kernels_start_kernel: Assertion `module' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
----- Backtrace -----
0x632dc7 ???
0xa02714 ???
0xa02a48 ???
0xbb5e71 ???
0x6f9335 ???
0x705b41 ???
0x7067c2 ???
0x6e7877 ???
0x6f9df1 ???
0x5b70dc ???
0x838052 ???
0x9d76da ???
0x7f7b09 ???
0x806ce3 ???
0xbb68fc ???
0xbb6ae6 ???
0x849b94 ???
0x84b454 ???
0x56ba84 ???
0x7fc4d858ccf2 ???
0x571d84 ???
0xffffffffffffffff ???
---------------------
cuda-gdb/12/gdb/cuda/cuda-kernel.c:543: internal-error: kernels_start_kernel: Assertion `module' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.     
      
cuda-gdb --version
NVIDIA (R) CUDA Debugger
CUDA Toolkit 12.3 release
Portions Copyright (C) 2007-2023 NVIDIA Corporation
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Using python library libpython3.6m.so.1.0
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0

Hi @BAdhi
Sorry about the confusion - you are correct the issue is still present in the 12.3 release. And thank you for the updated repro case.

I will update the bug as soon as we verify the fix one the scenario in Cuda-gdb crash when trying to debug kernel launched through `cudaLaunchCooperativeKernel` - #9 by BAdhi and the fixed version is released.

Hi @BAdhi,

The reported issue should be resolved in the latest CUDA 12.4 release: CUDA Toolkit 12.4 Downloads | NVIDIA Developer

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.