Cuda-gdb crash when trying to debug kernel launched through `cudaLaunchCooperativeKernel`

BAdhi · April 21, 2023, 3:16am

Im trying to debug a kernel that is launched using cudaLaunchCooperativeKernel since the kernel uses grid syncs. But when I try to run it inside cuda-gdb after setting a breakpoint inside the kernel, it crashes with the following stack trace

cuda-gdb/12/gdb/cuda/cuda-kernel.c:545: internal-error: kernels_start_kernel: Assertion `module' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
----- Backtrace -----
0x62e137 gdb_internal_backtrace()
0x9e2a04 internal_vproblem(internal_problem*, char const*, int, char const*, __va_list_tag*)
0x9e2d5c internal_verror(char const*, int, char const*, __va_list_tag*)
0xb96181 internal_error(char const*, int, char const*, ...)
0x6e3159 kernels_start_kernel(unsigned int, unsigned long, unsigned long, unsigned long, unsigned long, CuDim3, CuDim3, CUDBGKernelType, unsigned long, CUDBGKernelOrigin, bool, CuDim3)
0x6ed309 warp_get_kernel(unsigned int, unsigned int, unsigned int)
0x6e143c cuda_iterator_step(cuda_iterator_t*)
0x6e1c4e cuda_iterator_end(cuda_iterator_t*)
0x6f2ca2 cuda_breakpoint_hit_p(cuda_coords_t&)
0x5b3d57 cuda_nat_linux<amd64_linux_nat_target>::wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>)
0x817002 thread_db_target::wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>)
0x9b7b5a target_wait(ptid_t, target_waitstatus*, enum_flags<target_wait_flag>)
0x7d76b9 do_target_wait_1(inferior*, ptid_t, target_waitstatus*, enum_flags<target_wait_flag>)
0x7e66c3 fetch_inferior_event()
0xb96c0c gdb_wait_for_event(int)
0xb96df6 gdb_do_one_event()
0x828b24 captured_command_loop()
0x82a3d4 gdb_main(captured_main_args*)
0x5689a4 main
---------------------
cuda-gdb/12/gdb/cuda/cuda-kernel.c:545: internal-error: kernels_start_kernel: Assertion `module' failed.

Tried with all cuda-11.7, cuda-12.0 and cuda-12.1 (nvcc and cuda-gdb both). But the result is the same.

Any help to work around this issue is appreciated

AKravets · April 21, 2023, 9:29am

Hi @BAdhi
Thank you very much for your report! Could you also share the nvidia-smi command output on your machine?

BAdhi · April 27, 2023, 3:18am

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:01:00.0 Off |                  Off |
| N/A   34C    P0    87W / 500W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:41:00.0 Off |                  Off |
| N/A   36C    P0    88W / 500W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:81:00.0 Off |                  Off |
| N/A   32C    P0    84W / 500W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:C1:00.0 Off |                  Off |
| N/A   31C    P0    86W / 500W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+```

AKravets · April 28, 2023, 7:15am

Hi @BAdhi
Thank you very much for reporting the issue. We were able to reproduce it locally, so we are working on the fix. I will update this post when CUDA version with the fix is released.

BAdhi · July 11, 2023, 8:13pm

Hi @AKravets,

Is this bug fixed on cuda 12.2?

AKravets · July 12, 2023, 8:01am

Hi @BAdhi,
Unfortunately the issue has not been fixed yet.

AKravets · October 24, 2023, 9:12am

Hi @BAdhi
This issue should be fixed in the latest CUDA 12.3 release. You would need to have both 12.3 CUDA toolkit (https://developer.nvidia.com/cuda-downloads) and the R545 (12.3) GPU Driver installed to fix the crash.

veraj · November 15, 2023, 5:03am

We will close this thread after 2 weeks, if the issue comes up again, please file a new topic and we will do our best to help !

BAdhi · November 15, 2023, 6:18am

Unfortunately I’m still seeing that cuda-gdb crashes for the same scenario.

Here is a sample

#include <iostream>
#include <cooperative_groups.h>

#define CHECK_CUDA(call)                                                       \
  {                                                                            \
    cudaError_t err = call;                                                    \
    if (cudaSuccess != err) {                                                  \
      fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n", __FILE__,  \
              __LINE__, cudaGetErrorString(err));                              \
      exit(EXIT_FAILURE);                                                      \
    }                                                                          \
  }


using namespace cooperative_groups;

__device__ void test(int* a)
{
    grid_group grid = this_grid();
    grid.sync(); // add a breakpoint here
}

__global__ void test2(int *a) {
    test(a);
}


int main()
{
    int device = 0;
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, device);

    int numThreads = 7;
    int * value = new int[7];
    int * dValue = nullptr;
    CHECK_CUDA(cudaMalloc(&dValue, sizeof(int) * 7));
    for (int i = 0; i < 7; ++i)
        value[i] = i + 10;
    CHECK_CUDA(cudaMemcpy(dValue, value, sizeof(int) * 7 , cudaMemcpyHostToDevice));

    void* args[] = {&dValue};

    CHECK_CUDA(cudaLaunchCooperativeKernel((void*)test2, 1, 1, args));

    CHECK_CUDA(cudaDeviceSynchronize());
}

nvidia-smi header :

NVIDIA-SMI 545.23.06 Driver Version: 545.23.06 CUDA Version: 12.3

cuda-gdb output :

warning: Could not find CUDA module for context_id 0x70eaa0 module_id 0x0
cuda-gdb/12/gdb/cuda/cuda-kernel.c:543: internal-error: kernels_start_kernel: Assertion `module' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
----- Backtrace -----
0x632dc7 ???
0xa02714 ???
0xa02a48 ???
0xbb5e71 ???
0x6f9335 ???
0x705b41 ???
0x7067c2 ???
0x6e7877 ???
0x6f9df1 ???
0x5b70dc ???
0x838052 ???
0x9d76da ???
0x7f7b09 ???
0x806ce3 ???
0xbb68fc ???
0xbb6ae6 ???
0x849b94 ???
0x84b454 ???
0x56ba84 ???
0x7fc4d858ccf2 ???
0x571d84 ???
0xffffffffffffffff ???
---------------------
cuda-gdb/12/gdb/cuda/cuda-kernel.c:543: internal-error: kernels_start_kernel: Assertion `module' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.

cuda-gdb --version
NVIDIA (R) CUDA Debugger
CUDA Toolkit 12.3 release
Portions Copyright (C) 2007-2023 NVIDIA Corporation
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Using python library libpython3.6m.so.1.0

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0

AKravets · November 15, 2023, 8:42am

Hi @BAdhi
Sorry about the confusion - you are correct the issue is still present in the 12.3 release. And thank you for the updated repro case.

I will update the bug as soon as we verify the fix one the scenario in Cuda-gdb crash when trying to debug kernel launched through `cudaLaunchCooperativeKernel` - #9 by BAdhi and the fixed version is released.

AKravets · March 11, 2024, 9:13am

Hi @BAdhi,

The reported issue should be resolved in the latest CUDA 12.4 release: CUDA Toolkit 12.4 Downloads | NVIDIA Developer

veraj · April 29, 2024, 8:13am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cuda-gdb internal-error of copy_type on basic fortran example CUDA-GDB cuda-gdb	5	971	January 2, 2024
Cuda-gdb doesn't break and/or step into Kernels CUDA Programming and Performance	26	53717	August 1, 2011
When using cuda-gdb for debugging in CLion, setting a breakpoint within a kernel function causes CLion to freeze CUDA-GDB	2	349	July 30, 2024
Cuda-gdb aborted CUDA-GDB	7	140	November 24, 2024
CUDA-GDB captured "Illegal access to address" exception when I invoke child kernel, but the result is correct when free run CUDA Programming and Performance	6	1717	March 20, 2017
Cuda-gdb deadlock in pgi_managed_new CUDA-GDB	15	1078	November 13, 2023
Cuda-gdb crashes in demangler CUDA-GDB	3	1166	October 12, 2021
cuda-gdb crashes and device printf() CUDA Programming and Performance	5	2282	December 23, 2010
Cuda-gdb segmentation fault when stepping through program CUDA-GDB nvbugs	4	1826	March 7, 2023
Warning: Cuda API error detected: cuModuleLoadFatBinary returned (0xd1) CUDA-GDB	25	1082	July 30, 2024

Cuda-gdb crash when trying to debug kernel launched through `cudaLaunchCooperativeKernel`

Related topics