Cuda-gdb deadlock in pgi_managed_new

raphael.kuate · November 7, 2023, 2:29pm

I am unable to debug a c++ code compiled with nvc++ for gpu using std::for_each algorithms:
There is a deadlock in memory allocation, I’ve tried to disable unnecessary use of CUDA Unified memory, however some internal use of new operator calls pgi_managed_new in debug mode.
However this probelem occurs with the update of the GPU driver, which now uses cuda 12.1 I didn’t have theses problem when the driver was under CUDA 11, however the old HPC SDK 22 is no longer being supported by the GPU card. I have tried HPC SDK 23.7 and 23.9, and also standalone cuda-gdb 12.3, no success.
Here is a log

Using python library libpython3.6m.so
Reading symbols from build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver…
(cuda-gdb)
(cuda-gdb) run
Starting program: /odile/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver -i test_cases/CT01_pulse_uniform/input.txt -d test_cases
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
^C
Program received signal SIGINT, Interrupt.
0x00007ffff48d7e0f in nvompAcquireLock (lock=lock@entry=0x7ffff72e2bb8 <__pgi_uacc_data+72>) at nvomp_lock.h:38
38 nvomp_lock.h: Aucun fichier ou dossier de ce type.
(cuda-gdb) bt
#0 0x00007ffff48d7e0f in nvompAcquireLock (lock=lock@entry=0x7ffff72e2bb8 <__pgi_uacc_data+72>) at nvomp_lock.h:38
#1 __nvomp_acquire_lock (lock=lock@entry=0x7ffff72e2bb8 <__pgi_uacc_data+72>) at nvomp_lock.c:17
#2 0x00007ffff70b9a9d in __pgi_uacc_enumerate () at …/…/src/init.c:579
#3 __pgi_uacc_initialize () at …/…/src/init.c:693
#4 0x00007ffff693116c in do_managed_new (n=23, func_name=0x7ffff6932b58 “new”) at …/…/src/cuda_managed_new.cpp:59
#5 0x00007ffff6930e44 in __pgi_managed_new (n=23) at …/…/src/cuda_managed_new.cpp:153
#6 0x0000000000d3f549 in allocate (this=0x7fffffff8b28, __n=23, _T567_39031=0x0)
at /usr/lib/gcc/x86_64-redhat-linux/8/…/…/…/…/include/c++/8/ext/new_allocator.h:111
#7 0x0000000000d10ee9 in allocate (__a=…, __n=23) at /usr/lib/gcc/x86_64-redhat-linux/8/…/…/…/…/include/c++/8/bits/alloc_traits.h:436
#8 0x0000000000d2bb6e in _M_create (this=0x7fffffff8b28, __capacity=0x7fffffff8a88, __old_capacity=0)
at /usr/lib/gcc/x86_64-redhat-linux/8/…/…/…/…/include/c++/8/bits/basic_string.tcc:153
#9 0x0000000000d2d704 in _M_construct (this=0x7fffffff8b28, __beg=0x5353140 “SystemCategories.proto”, __end=0x5353156 “”, _T13_38380=…)
at /usr/lib/gcc/x86_64-redhat-linux/8/…/…/…/…/include/c++/8/bits/basic_string.tcc:219
#10 0x00007ffff0fb6a90 in ?? () from /lib64/libcudadebugger.so.1
#11 0x00007ffff0fb261d in ?? () from /lib64/libcudadebugger.so.1
#12 0x00007ffff0f60160 in ?? () from /lib64/libcudadebugger.so.1
#13 0x00007ffff0fbc46f in ?? () from /lib64/libcudadebugger.so.1
#14 0x00007ffff0fbc45a in ?? () from /lib64/libcudadebugger.so.1
#15 0x00007ffff7de3e0a in call_init (l=, argc=argc@entry=5, argv=argv@entry=0x7fffffffcbd8, env=env@entry=0x5169f60) at dl-init.c:72
#16 0x00007ffff7de3f0a in call_init (env=0x5169f60, argv=0x7fffffffcbd8, argc=5, l=) at dl-init.c:118
#17 _dl_init (main_map=0x5366290, argc=5, argv=0x7fffffffcbd8, env=0x5169f60) at dl-init.c:119
#18 0x00007ffff3b9e1dc in _dl_catch_exception () from /lib64/libc.so.6
#19 0x00007ffff7de7b2e in dl_open_worker (a=0x7fffffff90e0) at dl-open.c:819
#20 dl_open_worker (a=0x7fffffff90e0) at dl-open.c:782
#21 0x00007ffff3b9e184 in _dl_catch_exception () from /lib64/libc.so.6
#22 0x00007ffff7de7d11 in _dl_open (file=0x7ffff1a4c7c0 “libcudadebugger.so.1”, mode=, caller_dlopen=0x7ffff1870e37, nsid=-2, argc=5,
argv=, env=0x5169f60) at dl-open.c:900
#23 0x00007ffff7bcc1ea in dlopen_doit () from /lib64/libdl.so.2
#24 0x00007ffff3b9e184 in _dl_catch_exception () from /lib64/libc.so.6
#25 0x00007ffff3b9e243 in _dl_catch_error () from /lib64/libc.so.6
#26 0x00007ffff7bcc969 in _dlerror_run () from /lib64/libdl.so.2
#27 0x00007ffff7bcc28a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#28 0x00007ffff1870e37 in ?? () from /lib64/libcuda.so
#29 0x00007ffff15a4c7c in ?? () from /lib64/libcuda.so
–Type for more, q to quit, c to continue without paging–
#30 0x00007ffff1649a86 in ?? () from /lib64/libcuda.so
#31 0x00007ffff16c0df8 in ?? () from /lib64/libcuda.so
#32 0x00007ffff6b56b98 in __pgi_uacc_cuda_init () at …/…/src/cuda_init.c:464
#33 0x00007ffff70ba1ff in __pgi_uacc_enumerate () at …/…/src/init.c:607
#34 __pgi_uacc_initialize () at …/…/src/init.c:693
#35 0x00007ffff693116c in do_managed_new (n=60, func_name=0x7ffff6932b58 “new”) at …/…/src/cuda_managed_new.cpp:59
#36 0x00007ffff6930e44 in __pgi_managed_new (n=60) at …/…/src/cuda_managed_new.cpp:153
#37 0x0000000000d3f549 in allocate (this=0x7fffffff98a0, __n=60, _T567_39031=0x0)
at /usr/lib/gcc/x86_64-redhat-linux/8/…/…/…/…/include/c++/8/ext/new_allocator.h:111
#38 0x0000000000d10ee9 in allocate (__a=…, __n=60) at /usr/lib/gcc/x86_64-redhat-linux/8/…/…/…/…/include/c++/8/bits/alloc_traits.h:436
#39 0x0000000000d2bb6e in _M_create (this=0x7fffffff98a0, __capacity=0x7fffffff9878, __old_capacity=0)
at /usr/lib/gcc/x86_64-redhat-linux/8/…/…/…/…/include/c++/8/bits/basic_string.tcc:153
#40 0x0000000003a8839f in void std::__cxx11::basic_string<char, std::char_traits, std::allocator >::_M_construct<char const*>(char const*, char const*, std::forward_iterator_tag) [clone .constprop.140] ()
#41 0x0000000003a905ce in vtksys::SystemTools::CollapseFullPath(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, char const*) ()
#42 0x0000000003a90756 in vtksys::SystemTools::AddKeepPath(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) ()
#43 0x0000000003a90824 in vtksys::SystemTools::ClassInitialize() ()
#44 0x0000000000bc8b22 in _GLOBAL__sub_I_vtkGenericEnSightReader.cxx ()
#45 0x0000000003ccabfd in __libc_csu_init ()
#46 0x00007ffff3a6ec7e in __libc_start_main () from /lib64/libc.so.6

(cuda-gdb) q
A debugging session is active.

    Inferior 1 [process 584806] will be killed.

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
±--------------------------------------------------------------------------------------+

raphael.kuate · November 7, 2023, 2:35pm

Note
cuda-gdb details:
NVIDIA (R) CUDA Debugger
CUDA Toolkit 12.2 release
Portions Copyright (C) 2007-2023 NVIDIA Corporation
GNU gdb (GDB) 12.1

AKravets · November 7, 2023, 6:14pm

Hi @raphael.kuate
Thank you for the report! To help us identify the issue re-run the debugging scenario with additional logging enabled:

Add NVLOG_CONFIG_FILE variable pointing the nvlog.config file (attached). E.g.: NVLOG_CONFIG_FILE=${HOME}/nvlog.config
nvlog.config (539 Bytes)
Run the debugging session.
You should see the /tmp/debugger.log file created - could you share it with us?

raphael.kuate · November 7, 2023, 6:37pm

Thanks for your response,
i have tried but no file /tmp/debugger.log was created, the only file updated in /tmp is /tmp/cuda-dbg/cuda-gdb.lock
[rkuate@odile ~]$ ll -lt /tmp/|head
total 207300
drwxrwxrwx 2 rkuate prolb 27 7 nov. 19:30 cuda-dbg

AKravets · November 8, 2023, 9:19am

Hi @raphael.kuate
Could you also try the following config: nvlog.stdout.config (549 Bytes)

Same as with the previous one please:

Save it on your machine
Set NVLOG_CONFIG_FILE environment variable to point to this file:

export NVLOG_CONFIG_FILE=/path/to/nvlog.stdout.config

Re-run the debugging session from the same terminal. You might also see additional log messages printed to the terminal.
- Can you share the debugger output with us as well?
Check for /tmp/debugger.log file.

raphael.kuate · November 8, 2023, 10:32am

The problem remains the same.
Here is the terminal output
[rkuate@odile-gpu01 developLast]$ echo $NVLOG_CONFIG_FILE
/odile/home/rkuate/nvlog.stdout.config
[rkuate@odile-gpu01 developLast]$ head $NVLOG_CONFIG_FILE
$ /tmp/debugger.log
UseStdout
ForceFlush
Format $time|$sev|$level|$tid|${name:-20}|${sfunc:-60}- $text
NeverIntrude

All errors and warnings are enabled.

20iI 100wWefEF global

@ 50iI dbg
[rkuate@odile-gpu01 developLast]$ ll -lt /tmp|head
total 44
drwxrwxrwx 2 rkuate prolb 6 8 nov. 11:02 cuda-dbg
-rw------- 1 rkuate prolb 0 8 nov. 10:48 dbus-session-monitor.DkDtMd
drwx------ 3 rkuate prolb 22 7 nov. 20:38 595136
drwx------ 8 rkuate prolb 91 7 nov. 20:29 ompi.odile-gpu01.1116
drwx------ 3 rkuate prolb 22 7 nov. 20:29 594778
drwx------ 3 rkuate prolb 22 7 nov. 20:28 594649
drwx------ 3 rkuate prolb 22 7 nov. 20:27 594506
drwx------ 3 rkuate prolb 22 7 nov. 20:21 594317
drwx------ 3 rkuate prolb 22 7 nov. 20:16 594040
[rkuate@odile-gpu01 developLast]$ cuda-gdb --args build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver -i test_cases/CT01_pulse_uniform/input.txt -d test_cases
NVIDIA (R) CUDA Debugger
CUDA Toolkit 12.2 release
Portions Copyright (C) 2007-2023 NVIDIA Corporation
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type “show copying” and “show warranty” for details.
This GDB was configured as “x86_64-pc-linux-gnu”.
Type “show configuration” for configuration details.
For bug reporting instructions, please see:
https://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type “help”.
Type “apropos word” to search for commands related to “word”…
Using python library libpython3.6m.so.1.0
Reading symbols from build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver…
(cuda-gdb) run
Starting program: /local_scratch/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver -i test_cases/CT01_pulse_uniform/input.txt -d test_cases
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
^C
Program received signal SIGINT, Interrupt.
0x00007ffff4a03e0f in nvompAcquireLock (lock=lock@entry=0x7ffff740ebb8 <__pgi_uacc_data+72>) at nvomp_lock.h:38
38 nvomp_lock.h: Aucun fichier ou dossier de ce type.
(cuda-gdb) q
A debugging session is active.

    Inferior 1 [process 607745] will be killed.

Quit anyway? (y or n) y
[rkuate@odile-gpu01 developLast]$ ll -lt /tmp|head
total 44
drwxrwxrwx 2 rkuate prolb 6 8 nov. 11:07 cuda-dbg
-rw------- 1 rkuate prolb 0 8 nov. 10:48 dbus-session-monitor.DkDtMd
drwx------ 3 rkuate prolb 22 7 nov. 20:38 595136
drwx------ 8 rkuate prolb 91 7 nov. 20:29 ompi.odile-gpu01.1116
drwx------ 3 rkuate prolb 22 7 nov. 20:29 594778
drwx------ 3 rkuate prolb 22 7 nov. 20:28 594649
drwx------ 3 rkuate prolb 22 7 nov. 20:27 594506
drwx------ 3 rkuate prolb 22 7 nov. 20:21 594317
drwx------ 3 rkuate prolb 22 7 nov. 20:16 594040
[rkuate@odile-gpu01 developLast]$

However, it seems the kernels could not be loaded, when I run the code with gdb and break into std::terminate the backtrace seems not have entering the kernel
[rkuate@odile-gpu01 developLast]$ gdb --args build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver -i test_cases/CT01_pulse_uniform/input.txt -d test_cases
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-18.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type “show copying” and “show warranty” for details.
This GDB was configured as “x86_64-redhat-linux-gnu”.
Type “show configuration” for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type “help”.
Type “apropos word” to search for commands related to “word”…
Reading symbols from build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver…done.
(gdb) break std::terminate
Breakpoint 1 at 0x450f10
(gdb) run
Starting program: /local_scratch/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver -i test_cases/CT01_pulse_uniform/input.txt -d test_cases
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
Missing separate debuginfo for /odile/home/rkuate/libraries/system/usr/lib64/libatomic.so.1
Try: yum --enablerepo=‘debug’ install /usr/lib/debug/.build-id/8a/f1b6b137f44903c2abc2886f744789c3d1ab7e.debug
[New Thread 0x7fffebc69000 (LWP 607856)]
[New Thread 0x7fffeb468000 (LWP 607857)]
LBsolver 3.1.0 executed on: Wed Nov 08 11:13:55 2023 built on : Nov 7 13:27:27 2023

LBsolver build options | status | effect of build option
…
Charact. time of the sigma ramp [time step] = 8000
################# current time is 0/60 ################

Thread 1 “lbsolver” hit Breakpoint 1, 0x0000000000450f10 in std::terminate()@plt ()
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-225.el8.x86_64 libgcc-8.5.0-18.el8.x86_64 libstdc+±8.5.0-18.el8.x86_64 nvidia-driver-cuda-libs-535.104.05-1.el8.x86_64 zlib-1.2.11-21.el8_7.x86_64
(gdb) bt
#0 0x0000000000450f10 in std::terminate()@plt ()
#1 0x0000000000df9d80 in std::for_each (_T250_144510=0x49dff2b std::execution::par_unseq, __first=…, __last=…, __f=(unknown: 2196702720))
at /odile/home/rkuate/libraries/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/include/nvhpc/algorithm_execution.hpp:2344
#2 0x0000000000c434e1 in D3Q19HRRcore::DriverFunction (this=0x575d420, inodeBegin=1, inodeEnd=9556, tribal_status=TRIBAL_MONO, functions_to_launch=0x7fffffffba80, i_single_datas=0x7fffffffba68)
at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/hardlibrary/D3Q19HRRcore/D3Q19HRRcore_StdParallelTools.tpp:97
#3 0x0000000002a217da in Solver::delayExecuteRange (this=0x7fff820174c0, time_left=9.99999984e+17, delay=0, ielt_execute=8, range=…) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/solver/SolverDelay.cpp:467
#4 0x0000000002a22cdd in Solver::delayExecuteDelay (this=0x7fff820174c0, delay=0, time_left=@0x7fffffffbe68: 9.99999984e+17) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/solver/SolverDelay.cpp:709
#5 0x0000000002a22e10 in Solver::delayExecuteUntil (this=0x7fff820174c0, ielt_sync=6, time_limit=9.99999984e+17, max_execute_delay=7) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/solver/SolverDelay.cpp:749
#6 0x0000000002aeb929 in Solver::loop (this=0x7fff820174c0) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/solver/SolverLoop.cpp:212
#7 0x00000000024a1c8c in sequenceSolverLoop (process=0x7fff82013100) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/project/SolveProject.cpp:563
#8 0x0000000001f481e6 in ParallelEmulated::executeOnce (this=0x7fff82003f00, sequence=0x24a1c40 <sequenceSolverLoop(ProcessSolve*)>) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/parallel/ParallelEmulated.cpp:63
#9 0x00000000024a3578 in solveProject (parallel=0x7fff82003f00) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/project/SolveProject.cpp:923
#10 0x000000000190f5c8 in run (parallel=0x7fff82003f00) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/labs/run_LaBS.cpp:624
#11 0x000000000190e4e5 in tryLbsolver (argc=5, argv=0x7fffffffd5e8) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/labs/run_LaBS.cpp:503
#12 0x000000000190f6a9 in runLbsolver (argc=5, argv=0x7fffffffd5e8) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/labs/run_LaBS.cpp:644
#13 0x0000000001c4af3d in main (argc=5, argv=0x7fffffffd5e8) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/mains/LaBS.cpp:18
(gdb)

In GUI mode the only message I can read is cudaErrorLaunchFailure(179)

Sincerely,

AKravets · November 8, 2023, 12:04pm

Hi @raphael.kuate
Having operator new overridden can cause such issues (GPU debugger back-end calls this operator).

One possible solution might be to switch to a legacy back-end (which only uses malloc). Please try running the debugging session with CUDBG_USE_LEGACY_DEBUGGER=1 environment variable.

raphael.kuate · November 8, 2023, 12:59pm

Hi,

Your last suggestion gives the same result.

In fact overriding operator new has no effect since these calls are internal to so some std code. What is done is that we use another allocator when possible, and CUDA unified memory is used only on data shared among host and device. The reason is that we had some issues using CUDA unified memory for all the code (we had several exchanges with your organization about… with Gonzalo Brito, Niveditha Krishnamoorthy …) so, the solution was to use another allocator for std containers. However, we had no problem launching cuda-gdb in the previous version of the GPU driver, which targeted CUDA 11. So I tried to avoid using operator new when current cuda-gdb fails to launch the code, however as you can see, this is not always possible within these internal std calls to operator new.

On the other hand the main problem I have (if it can be solved without using cuda-gdb) is that the GPU kernels seems not to load, it may be related to the size of the GPU kernel being loaded. I’ve tried several options, printing ptx info was not showing code exceeding KBs. I however have this warning of which I founded no appropriate solution: tracking recursion in the code, custom stacklimit, using -Mchkstk with nvc++ …

<<

nvlink warning : Stack size for entry function ‘ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_10for_each_fI23ContiguousRangeIteratorNS_6detail16wrapped_functionI21ComputeNodeAllPhysicsILi0ELi0EEvEEEExEESC_xEEvT0_T1’ cannot be statically determined

So is there a way of knowing exactly if GPU kernels limits exceed ?

The profiler nvprof gave me this summary

<<

==3739841== Profiling application: build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver -i test_cases/CT01_pulse_uniform/input.txt -d test_cases/

==3739841== Warning: 2 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.

Howerer, the suggested --device-buffer-size option did not improve the profiler outcome.

Sincerely,

Raphaël.

AKravets · November 10, 2023, 9:57am

Hi @raphael.kuate

In fact overriding operator new has no effect since these calls are internal to so some std code.

In the first message (in the backtrace) the memory allocation calls from libcudadebugger.so (GPU debugging library, which is loaded into the debugged application) are routed to the __pgi_managed_new which breaks the debugger library initialization.

However, we had no problem launching cuda-gdb in the previous version of the GPU driver, which targeted CUDA 11.

Debugger back-end was changed in CUDA 11.8. The new back-end is using C++ memory allocation while the old one (pre 11.8) used malloc.

Your last suggestion gives the same result.

Can you share the backtrace of the deadlock (as in the first message) with CUDBG_USE_LEGACY_DEBUGGER=1 environment variable. This variable should force old back-end (which doesn’t use C++ memory allocations).

On the other hand the main problem I have (if it can be solved without using cuda-gdb) is that the GPU kernels seems not to load, it may be related to the size of the GPU kernel being loaded.

Can you provide more details here - do you see any error messages printed (or CUDA calls returning error codes)?

We can also try reproducing this issue on our end - would it be possible for you to share the repro steps with us?

raphael.kuate · November 10, 2023, 11:01am

Hi,
I’m sorry but I did not check that the variable was correctly exported, or something went wrong with my terminal session on last tests. So, you are right, the debugger now initializes, I was just saying that the new operator was not overloaded.
The problem now is about stack limits. Here follows the errors
Debug version
CUDA Exception: Lane User Stack Overflow

Thread 1 “lbsolver” received signal CUDA_EXCEPTION_2, Lane User Stack Overflow.
[Switching focus to CUDA kernel 0, grid 2, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 1, lane 0]
0x00000000060d5bd0 in thrust::cuda_cub::for_each_f<ContiguousRangeIterator, thrust::detail::wrapped_function<ComputeNodeAllPhysics<0, 0>, void> >::operator() (this=, idx=) at /odile/home/rkuate/libraries/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/include-stdpar/thrust/system/cuda/detail/for_each.h:57
57 {

Release version

CUDA Exception: Lane User Stack Overflow

Thread 1 “lbsolver” received signal CUDA_EXCEPTION_2, Lane User Stack Overflow.
[Switching focus to CUDA kernel 0, grid 2, block (0,0,0), thread (96,0,0), device 0, sm 0, warp 0, lane 0]
0x000000000468e170 in ComputeNodeAllPhysics<0, 0>::operator() () at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/hardlibrary/StdParallelTools.hpp:236
236 inline void operator()(size_t inode) {

The difference is that the entry point of the kernel is more precised on the release version.

Sincerely

AKravets · November 10, 2023, 11:54am

This error indicates that you kernel threads are exceeding the stack limit (see 1. Introduction — CUDA-GDB 12.3 documentation ).

You can try using cudaDeviceSetLimit to increase the GPU stack size.

raphael.kuate · November 10, 2023, 5:01pm

Hi,
Thanks a lot. The environment variable NVCOMPILER_ACC_CUDA_STACKSIZE works, since no cuda code is written but only standard c++ and stdpar for GPU. I will try later to mix cuda with standard c++ and stdpar algorithms for GPU.
Sincerely.

AKravets · November 12, 2023, 2:03pm

Hi!

Glad I was able to help! Is there anything else we can help with with regards to this question?

raphael.kuate · November 13, 2023, 1:28pm

Hi,
Once more, thanks a lot. I’ll be back if further questions occurs. But at this point, my developments continues.

AKravets · November 13, 2023, 1:39pm

Thank you for the confirmation! Glad we were able to help!

system · November 27, 2023, 1:40pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cuda-gdb doesn't break and/or step into Kernels CUDA Programming and Performance	26	53717	August 1, 2011
Deepstream can't debug with cuda-gdb successfully CUDA-GDB	11	1375	May 27, 2024
Warning: Cuda API error detected: cuModuleLoadFatBinary returned (0xd1) CUDA-GDB	25	1081	July 30, 2024
Cuda-gdb internal-error of copy_type on basic fortran example CUDA-GDB cuda-gdb	5	970	January 2, 2024
Cuda-gdb hangs indefinitely CUDA-GDB	23	4295	January 16, 2024
When using cuda-gdb for debugging in CLion, setting a breakpoint within a kernel function causes CLion to freeze CUDA-GDB	2	347	July 30, 2024
Possible debugger bug? Debugger doesn't recognize functions instantiated from templates CUDA Programming and Performance	7	2971	August 3, 2009
Nv-nsight-cu-cli segfault Nsight Compute cuda	6	1748	May 5, 2020
Cannot run any CUDA kernels CUDA runtime doesn't recognize NVIDIA GPU CUDA Programming and Performance	26	12334	August 24, 2010
CUDA 2.1 beta CUDA Programming and Performance	49	67161	December 3, 2008

Cuda-gdb deadlock in pgi_managed_new

All errors and warnings are enabled.

Related topics