Hello,
I’m trying to familiarize myself with cuda-gdb before using it on my large application. I have tried using it on the first example from the cuda Fortran programing guide but have been encountering internal errors.
I’m just getting started with GPU programing, so I could have overlooked something obvious.
Here is the example code I have been trying to run:
module mytests
contains
attributes(global) &
subroutine test1( a )
integer, device :: a(*)
i = threadIdx%x
a(i) = i
return
end subroutine test1
end module mytests
program t1
use cudafor
use mytests
integer, parameter :: n = 100
integer, allocatable, device :: iarr(:)
integer h(n)
istat = cudaSetDevice(0)
allocate(iarr(n))
h = 0; iarr = h
call test1<<<1,n>>> (iarr)
h = iarr
print *,&
"Errors: ", count(h.ne.(/ (i,i=1,n) /))
deallocate(iarr)
end program t1
! set break point with
! break count.F90:6
Which I compile using nvfortran -cuda -g -gpu=debug -o count count.F90
When using cuda-gdb to stop at a breakpoint within the function I get an “internal error”. Below is the log of my commands and the errors
> cuda-gdb ./count
NVIDIA (R) CUDA Debugger
CUDA Toolkit 12.2 release
Portions Copyright (C) 2007-2023 NVIDIA Corporation
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
--Type <RET> for more, q to quit, c to continue without paging--
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Using python library libpython3.6m.so
--Type <RET> for more, q to quit, c to continue without paging--
Reading symbols from ./count...
(cuda-gdb) break count.F90:6
Breakpoint 1 at 0x4015ee: file count.F90, line 9.
(cuda-gdb) c
The program is not being run.
(cuda-gdb) run
Starting program: /vast_swbuild/swbuild3/janibal/LAVA_RESEARCH/src/curv/test_mat_mul/count
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x15554947e000 (LWP 88918)]
[Detaching after fork from child process 88919]
[New Thread 0x15554901c000 (LWP 88929)]
[New Thread 0x155548e1b000 (LWP 88930)]
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
Thread 1 "count" hit Breakpoint 1, mytests::test1<<<(1,1,1),(100,1,1)>>> (cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
----- Backtrace -----
0x62d087 ???
0x9e1344 ???
0x9e169c ???
0xb94ac1 ???
0x7a3036 ???
0x7a5967 ???
0x7a6469 ???
0x9f6bdb ???
0x9e56fa ???
0x71a479 ???
0x7303c2 ???
0x730a45 ???
0x78f849 ???
0x97071e ???
0x971239 ???
0x972b32 ???
0x973315 ???
0x7daacb ???
0x65ff9a ???
0x7e9813 ???
0x7dc283 ???
0x7e7030 ???
0xb9554c ???
0xb95736 ???
0x828fe4 ???
0x82a8a4 ???
0x5679e4 ???
0x153fc14b3d84 ???
0x56d8f4 ???
0xffffffffffffffff ???
---------------------
cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) n
This is a bug, please report it. For instructions, see:
<https://www.gnu.org/software/gdb/bugs/>.
cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) n
a=<error reading variable: copy_type: Assertion `type->is_objfile_owned ()' failed.>) at count.F90:6
6 i = threadIdx%x
(cuda-gdb) p iarr
No symbol "iarr" in current context.
(cuda-gdb) n
7 a(i) = i
(cuda-gdb) p a(0)
cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
----- Backtrace -----
0x62d087 ???
0x9e1344 ???
0x9e169c ???
0xb94ac1 ???
0x7a3036 ???
0x7a5967 ???
0x7a6469 ???
0x9f6bdb ???
0x9e56fa ???
0x71a479 ???
0x7303c2 ???
0x730a45 ???
0x78f849 ???
0x9e68ef ???
0x772882 ???
0x78937a ???
0x771ca5 ???
0x8a1b0e ???
0x8a248a ???
0x65c653 ???
0x9c4249 ???
0x777bdb ???
0x777efa ???
0x7784e8 ???
0xa1dd74 ???
0x776fe5 ???
0x7783cc ???
0x776dbf ???
0xb9554c ???
0xb9571a ???
0x828fe4 ???
0x82a8a4 ???
0x5679e4 ???
0x153fc14b3d84 ???
0x56d8f4 ???
0xffffffffffffffff ???
---------------------
cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) n
This is a bug, please report it. For instructions, see:
<https://www.gnu.org/software/gdb/bugs/>.
cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) n
copy_type: Assertion `type->is_objfile_owned ()' failed.
(cuda-gdb) l
2 contains
3 attributes(global) &
4 subroutine test1( a )
5 integer, device :: a(*)
6 i = threadIdx%x
7 a(i) = i
8 return
9 end subroutine test1
10 end module mytests
11
(cuda-gdb)
I can hit no
through all the errors, but If I try to access information about the variables within the kernel then the errors reappear.
I’m using nvhpc 23.7 (the latest we have on our cluster).
The output of nvidia-smi is
Wed Dec 27 12:16:18 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB Off| 00000000:4C:00.0 Off | 0 |
| N/A 35C P0 53W / 400W| 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
I’ve tried modifying the compiler flags, adding -gpu=cc80 and removing -gpu=debug, but it had no effect.