Cuda-gdb internal-error of copy_type on basic fortran example

Hello,

I’m trying to familiarize myself with cuda-gdb before using it on my large application. I have tried using it on the first example from the cuda Fortran programing guide but have been encountering internal errors.
I’m just getting started with GPU programing, so I could have overlooked something obvious.

Here is the example code I have been trying to run:

module mytests
    contains
    attributes(global)  &
    subroutine test1( a )
    integer, device :: a(*)
    i = threadIdx%x
    a(i) = i
    return
    end subroutine test1
end module mytests

program t1
    use cudafor
    use mytests
    integer, parameter :: n = 100
    integer, allocatable, device :: iarr(:)
    integer h(n)
    istat = cudaSetDevice(0)
    allocate(iarr(n))
    h = 0; iarr = h
    call test1<<<1,n>>> (iarr)
    h = iarr
    print *,&
    "Errors: ", count(h.ne.(/ (i,i=1,n) /))
    deallocate(iarr)
end program t1
! set break point with 
! break count.F90:6 

Which I compile using nvfortran -cuda -g -gpu=debug -o count count.F90

When using cuda-gdb to stop at a breakpoint within the function I get an “internal error”. Below is the log of my commands and the errors

> cuda-gdb ./count
NVIDIA (R) CUDA Debugger
CUDA Toolkit 12.2 release
Portions Copyright (C) 2007-2023 NVIDIA Corporation
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
--Type <RET> for more, q to quit, c to continue without paging--
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Using python library libpython3.6m.so
--Type <RET> for more, q to quit, c to continue without paging--
Reading symbols from ./count...
(cuda-gdb) break count.F90:6
Breakpoint 1 at 0x4015ee: file count.F90, line 9.
(cuda-gdb) c
The program is not being run.
(cuda-gdb) run
Starting program: /vast_swbuild/swbuild3/janibal/LAVA_RESEARCH/src/curv/test_mat_mul/count 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x15554947e000 (LWP 88918)]
[Detaching after fork from child process 88919]
[New Thread 0x15554901c000 (LWP 88929)]
[New Thread 0x155548e1b000 (LWP 88930)]
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

Thread 1 "count" hit Breakpoint 1, mytests::test1<<<(1,1,1),(100,1,1)>>> (cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
----- Backtrace -----
0x62d087 ???
0x9e1344 ???
0x9e169c ???
0xb94ac1 ???
0x7a3036 ???
0x7a5967 ???
0x7a6469 ???
0x9f6bdb ???
0x9e56fa ???
0x71a479 ???
0x7303c2 ???
0x730a45 ???
0x78f849 ???
0x97071e ???
0x971239 ???
0x972b32 ???
0x973315 ???
0x7daacb ???
0x65ff9a ???
0x7e9813 ???
0x7dc283 ???
0x7e7030 ???
0xb9554c ???
0xb95736 ???
0x828fe4 ???
0x82a8a4 ???
0x5679e4 ???
0x153fc14b3d84 ???
0x56d8f4 ???
0xffffffffffffffff ???
---------------------
cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) n

This is a bug, please report it.  For instructions, see:
<https://www.gnu.org/software/gdb/bugs/>.

cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) n
a=<error reading variable: copy_type: Assertion `type->is_objfile_owned ()' failed.>) at count.F90:6
6           i = threadIdx%x
(cuda-gdb) p iarr
No symbol "iarr" in current context.
(cuda-gdb) n
7           a(i) = i
(cuda-gdb) p a(0)
cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
----- Backtrace -----
0x62d087 ???
0x9e1344 ???
0x9e169c ???
0xb94ac1 ???
0x7a3036 ???
0x7a5967 ???
0x7a6469 ???
0x9f6bdb ???
0x9e56fa ???
0x71a479 ???
0x7303c2 ???
0x730a45 ???
0x78f849 ???
0x9e68ef ???
0x772882 ???
0x78937a ???
0x771ca5 ???
0x8a1b0e ???
0x8a248a ???
0x65c653 ???
0x9c4249 ???
0x777bdb ???
0x777efa ???
0x7784e8 ???
0xa1dd74 ???
0x776fe5 ???
0x7783cc ???
0x776dbf ???
0xb9554c ???
0xb9571a ???
0x828fe4 ???
0x82a8a4 ???
0x5679e4 ???
0x153fc14b3d84 ???
0x56d8f4 ???
0xffffffffffffffff ???
---------------------
cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) n

This is a bug, please report it.  For instructions, see:
<https://www.gnu.org/software/gdb/bugs/>.

cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) n
copy_type: Assertion `type->is_objfile_owned ()' failed.
(cuda-gdb) l
2           contains
3           attributes(global)  &
4           subroutine test1( a )
5           integer, device :: a(*)
6           i = threadIdx%x
7           a(i) = i
8           return
9           end subroutine test1
10      end module mytests
11
(cuda-gdb) 

I can hit no through all the errors, but If I try to access information about the variables within the kernel then the errors reappear.

I’m using nvhpc 23.7 (the latest we have on our cluster).
The output of nvidia-smi is

Wed Dec 27 12:16:18 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB           Off| 00000000:4C:00.0 Off |                    0 |
| N/A   35C    P0               53W / 400W|      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I’ve tried modifying the compiler flags, adding -gpu=cc80 and removing -gpu=debug, but it had no effect.

Hi, @joanib14

Thanks for reporting the issue to us ! From the content you paste, it looks like you are using 12.1 driver - 530.30.02 and 12.2 toolkit ?

We try to reproduce internally, and found this only happens when driver/toolkit version not consistent. We’ll check internally if this need to be fixed.

Would you please update your driver to R535 to have a try ?

Hi,

Thanks for the quick reply. I tried a node type with the updated driver (535.104.12) but still got the same result.

Here is the log of what I tried and the output of nvidia-cuda.

> nvfortran -cuda -g -gpu=debug,cc80 -o count count.F90
> cuda-gdb ./count
NVIDIA (R) CUDA Debugger
CUDA Toolkit 12.2 release
Portions Copyright (C) 2007-2023 NVIDIA Corporation
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Using python library libpython3.6m.so
Reading symbols from ./count...
(cuda-gdb) break count.F90:6
Breakpoint 1 at 0x4015ee: file count.F90, line 9.
(cuda-gdb) run
Starting program: /vast_swbuild/swbuild3/janibal/LAVA_RESEARCH/src/curv/test_mat_mul/count 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x15554953e000 (LWP 24821)]
[Detaching after fork from child process 24822]
[New Thread 0x155548dfb000 (LWP 24835)]
[New Thread 0x155548bfa000 (LWP 24836)]
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

Thread 1 "count" hit Breakpoint 1, mytests::test1<<<(1,1,1),(100,1,1)>>> (cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
----- Backtrace -----
0x62d087 ???
0x9e1344 ???
0x9e169c ???
0xb94ac1 ???
0x7a3036 ???
0x7a5967 ???
0x7a6469 ???
0x9f6bdb ???
0x9e56fa ???
0x71a479 ???
0x7303c2 ???
0x730a45 ???
0x78f849 ???
0x97071e ???
0x971239 ???
0x972b32 ???
0x973315 ???
0x7daacb ???
0x65ff9a ???
0x7e9813 ???
0x7dc283 ???
0x7e7030 ???
0xb9554c ???
0xb95736 ???
0x828fe4 ???
0x82a8a4 ???
0x5679e4 ???
0x155553ee4d84 ???
0x56d8f4 ???
0xffffffffffffffff ???
---------------------
cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) y

This is a bug, please report it.  For instructions, see:
<https://www.gnu.org/software/gdb/bugs/>.

cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) n
> nvidia-smi
Thu Dec 28 08:03:31 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:09:00.0 Off |                    0 |
| N/A   29C    P0              62W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:C7:00.0 Off |                    0 |
| N/A   35C    P0              72W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Hi, @joanib14

We tried more combinations today.
12.2 toolkit +12.1 driver (530.30.02): fail
12.2 toolkit + 12.2 driver (535.104.12) :fail
12.3 toolkit +12.1 driver (530.30.02): pass
12.3 toolkit +12.2 driver (535.104.12) : pass
12.3 toolkit + 12.3 driver (545.23.08): pass

So it seems the issue is already fixed in CUDA12.3.
Would you please try with our latest CUDA12.3 ? Sorry for the inaccurate info in my last reply.

After the HPC admins added nvhpc 23.11 I am no longer having an issues with my toy problem. Thanks for your help!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.