Cuda-gdb deadlock in pgi_managed_new

I am unable to debug a c++ code compiled with nvc++ for gpu using std::for_each algorithms:
There is a deadlock in memory allocation, I’ve tried to disable unnecessary use of CUDA Unified memory, however some internal use of new operator calls pgi_managed_new in debug mode.
However this probelem occurs with the update of the GPU driver, which now uses cuda 12.1 I didn’t have theses problem when the driver was under CUDA 11, however the old HPC SDK 22 is no longer being supported by the GPU card. I have tried HPC SDK 23.7 and 23.9, and also standalone cuda-gdb 12.3, no success.
Here is a log

Using python library libpython3.6m.so
Reading symbols from build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver…
(cuda-gdb)
(cuda-gdb) run
Starting program: /odile/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver -i test_cases/CT01_pulse_uniform/input.txt -d test_cases
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
^C
Program received signal SIGINT, Interrupt.
0x00007ffff48d7e0f in nvompAcquireLock (lock=lock@entry=0x7ffff72e2bb8 <__pgi_uacc_data+72>) at nvomp_lock.h:38
38 nvomp_lock.h: Aucun fichier ou dossier de ce type.
(cuda-gdb) bt
#0 0x00007ffff48d7e0f in nvompAcquireLock (lock=lock@entry=0x7ffff72e2bb8 <__pgi_uacc_data+72>) at nvomp_lock.h:38
#1 __nvomp_acquire_lock (lock=lock@entry=0x7ffff72e2bb8 <__pgi_uacc_data+72>) at nvomp_lock.c:17
#2 0x00007ffff70b9a9d in __pgi_uacc_enumerate () at …/…/src/init.c:579
#3 __pgi_uacc_initialize () at …/…/src/init.c:693
#4 0x00007ffff693116c in do_managed_new (n=23, func_name=0x7ffff6932b58 “new”) at …/…/src/cuda_managed_new.cpp:59
#5 0x00007ffff6930e44 in __pgi_managed_new (n=23) at …/…/src/cuda_managed_new.cpp:153
#6 0x0000000000d3f549 in allocate (this=0x7fffffff8b28, __n=23, _T567_39031=0x0)
at /usr/lib/gcc/x86_64-redhat-linux/8/…/…/…/…/include/c++/8/ext/new_allocator.h:111
#7 0x0000000000d10ee9 in allocate (__a=…, __n=23) at /usr/lib/gcc/x86_64-redhat-linux/8/…/…/…/…/include/c++/8/bits/alloc_traits.h:436
#8 0x0000000000d2bb6e in _M_create (this=0x7fffffff8b28, __capacity=0x7fffffff8a88, __old_capacity=0)
at /usr/lib/gcc/x86_64-redhat-linux/8/…/…/…/…/include/c++/8/bits/basic_string.tcc:153
#9 0x0000000000d2d704 in _M_construct (this=0x7fffffff8b28, __beg=0x5353140 “SystemCategories.proto”, __end=0x5353156 “”, _T13_38380=…)
at /usr/lib/gcc/x86_64-redhat-linux/8/…/…/…/…/include/c++/8/bits/basic_string.tcc:219
#10 0x00007ffff0fb6a90 in ?? () from /lib64/libcudadebugger.so.1
#11 0x00007ffff0fb261d in ?? () from /lib64/libcudadebugger.so.1
#12 0x00007ffff0f60160 in ?? () from /lib64/libcudadebugger.so.1
#13 0x00007ffff0fbc46f in ?? () from /lib64/libcudadebugger.so.1
#14 0x00007ffff0fbc45a in ?? () from /lib64/libcudadebugger.so.1
#15 0x00007ffff7de3e0a in call_init (l=, argc=argc@entry=5, argv=argv@entry=0x7fffffffcbd8, env=env@entry=0x5169f60) at dl-init.c:72
#16 0x00007ffff7de3f0a in call_init (env=0x5169f60, argv=0x7fffffffcbd8, argc=5, l=) at dl-init.c:118
#17 _dl_init (main_map=0x5366290, argc=5, argv=0x7fffffffcbd8, env=0x5169f60) at dl-init.c:119
#18 0x00007ffff3b9e1dc in _dl_catch_exception () from /lib64/libc.so.6
#19 0x00007ffff7de7b2e in dl_open_worker (a=0x7fffffff90e0) at dl-open.c:819
#20 dl_open_worker (a=0x7fffffff90e0) at dl-open.c:782
#21 0x00007ffff3b9e184 in _dl_catch_exception () from /lib64/libc.so.6
#22 0x00007ffff7de7d11 in _dl_open (file=0x7ffff1a4c7c0 “libcudadebugger.so.1”, mode=, caller_dlopen=0x7ffff1870e37, nsid=-2, argc=5,
argv=, env=0x5169f60) at dl-open.c:900
#23 0x00007ffff7bcc1ea in dlopen_doit () from /lib64/libdl.so.2
#24 0x00007ffff3b9e184 in _dl_catch_exception () from /lib64/libc.so.6
#25 0x00007ffff3b9e243 in _dl_catch_error () from /lib64/libc.so.6
#26 0x00007ffff7bcc969 in _dlerror_run () from /lib64/libdl.so.2
#27 0x00007ffff7bcc28a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#28 0x00007ffff1870e37 in ?? () from /lib64/libcuda.so
#29 0x00007ffff15a4c7c in ?? () from /lib64/libcuda.so
–Type for more, q to quit, c to continue without paging–
#30 0x00007ffff1649a86 in ?? () from /lib64/libcuda.so
#31 0x00007ffff16c0df8 in ?? () from /lib64/libcuda.so
#32 0x00007ffff6b56b98 in __pgi_uacc_cuda_init () at …/…/src/cuda_init.c:464
#33 0x00007ffff70ba1ff in __pgi_uacc_enumerate () at …/…/src/init.c:607
#34 __pgi_uacc_initialize () at …/…/src/init.c:693
#35 0x00007ffff693116c in do_managed_new (n=60, func_name=0x7ffff6932b58 “new”) at …/…/src/cuda_managed_new.cpp:59
#36 0x00007ffff6930e44 in __pgi_managed_new (n=60) at …/…/src/cuda_managed_new.cpp:153
#37 0x0000000000d3f549 in allocate (this=0x7fffffff98a0, __n=60, _T567_39031=0x0)
at /usr/lib/gcc/x86_64-redhat-linux/8/…/…/…/…/include/c++/8/ext/new_allocator.h:111
#38 0x0000000000d10ee9 in allocate (__a=…, __n=60) at /usr/lib/gcc/x86_64-redhat-linux/8/…/…/…/…/include/c++/8/bits/alloc_traits.h:436
#39 0x0000000000d2bb6e in _M_create (this=0x7fffffff98a0, __capacity=0x7fffffff9878, __old_capacity=0)
at /usr/lib/gcc/x86_64-redhat-linux/8/…/…/…/…/include/c++/8/bits/basic_string.tcc:153
#40 0x0000000003a8839f in void std::__cxx11::basic_string<char, std::char_traits, std::allocator >::_M_construct<char const*>(char const*, char const*, std::forward_iterator_tag) [clone .constprop.140] ()
#41 0x0000000003a905ce in vtksys::SystemTools::CollapseFullPath(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, char const*) ()
#42 0x0000000003a90756 in vtksys::SystemTools::AddKeepPath(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) ()
#43 0x0000000003a90824 in vtksys::SystemTools::ClassInitialize() ()
#44 0x0000000000bc8b22 in _GLOBAL__sub_I_vtkGenericEnSightReader.cxx ()
#45 0x0000000003ccabfd in __libc_csu_init ()
#46 0x00007ffff3a6ec7e in __libc_start_main () from /lib64/libc.so.6

(cuda-gdb) q
A debugging session is active.

    Inferior 1 [process 584806] will be killed.

Quit anyway? (y or n) y
[rkuate@odile-gpu01 developLast]$ nvidia-smi
Tue Nov 7 15:28:51 2023
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:17:00.0 Off | 0 |
| N/A 32C P0 43W / 300W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
±--------------------------------------------------------------------------------------+

Note
cuda-gdb details:
NVIDIA (R) CUDA Debugger
CUDA Toolkit 12.2 release
Portions Copyright (C) 2007-2023 NVIDIA Corporation
GNU gdb (GDB) 12.1

Hi @raphael.kuate
Thank you for the report! To help us identify the issue re-run the debugging scenario with additional logging enabled:

  • Add NVLOG_CONFIG_FILE variable pointing the nvlog.config file (attached). E.g.: NVLOG_CONFIG_FILE=${HOME}/nvlog.config
    nvlog.config (539 Bytes)

  • Run the debugging session.

  • You should see the /tmp/debugger.log file created - could you share it with us?

Thanks for your response,
i have tried but no file /tmp/debugger.log was created, the only file updated in /tmp is /tmp/cuda-dbg/cuda-gdb.lock
[rkuate@odile ~]$ ll -lt /tmp/|head
total 207300
drwxrwxrwx 2 rkuate prolb 27 7 nov. 19:30 cuda-dbg

Hi @raphael.kuate
Could you also try the following config: nvlog.stdout.config (549 Bytes)

Same as with the previous one please:

  • Save it on your machine
  • Set NVLOG_CONFIG_FILE environment variable to point to this file:
export NVLOG_CONFIG_FILE=/path/to/nvlog.stdout.config 
  • Re-run the debugging session from the same terminal. You might also see additional log messages printed to the terminal.
    • Can you share the debugger output with us as well?
  • Check for /tmp/debugger.log file.

The problem remains the same.
Here is the terminal output
[rkuate@odile-gpu01 developLast]$ echo $NVLOG_CONFIG_FILE
/odile/home/rkuate/nvlog.stdout.config
[rkuate@odile-gpu01 developLast]$ head $NVLOG_CONFIG_FILE
$ /tmp/debugger.log
UseStdout
ForceFlush
Format $time|$sev|$level|$tid|${name:-20}|${sfunc:-60}- $text
NeverIntrude

All errors and warnings are enabled.

  • 20iI 100wWefEF global

@ 50iI dbg
[rkuate@odile-gpu01 developLast]$ ll -lt /tmp|head
total 44
drwxrwxrwx 2 rkuate prolb 6 8 nov. 11:02 cuda-dbg
-rw------- 1 rkuate prolb 0 8 nov. 10:48 dbus-session-monitor.DkDtMd
drwx------ 3 rkuate prolb 22 7 nov. 20:38 595136
drwx------ 8 rkuate prolb 91 7 nov. 20:29 ompi.odile-gpu01.1116
drwx------ 3 rkuate prolb 22 7 nov. 20:29 594778
drwx------ 3 rkuate prolb 22 7 nov. 20:28 594649
drwx------ 3 rkuate prolb 22 7 nov. 20:27 594506
drwx------ 3 rkuate prolb 22 7 nov. 20:21 594317
drwx------ 3 rkuate prolb 22 7 nov. 20:16 594040
[rkuate@odile-gpu01 developLast]$ cuda-gdb --args build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver -i test_cases/CT01_pulse_uniform/input.txt -d test_cases
NVIDIA (R) CUDA Debugger
CUDA Toolkit 12.2 release
Portions Copyright (C) 2007-2023 NVIDIA Corporation
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type “show copying” and “show warranty” for details.
This GDB was configured as “x86_64-pc-linux-gnu”.
Type “show configuration” for configuration details.
For bug reporting instructions, please see:
https://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type “help”.
Type “apropos word” to search for commands related to “word”…
Using python library libpython3.6m.so.1.0
Reading symbols from build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver…
(cuda-gdb) run
Starting program: /local_scratch/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver -i test_cases/CT01_pulse_uniform/input.txt -d test_cases
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
^C
Program received signal SIGINT, Interrupt.
0x00007ffff4a03e0f in nvompAcquireLock (lock=lock@entry=0x7ffff740ebb8 <__pgi_uacc_data+72>) at nvomp_lock.h:38
38 nvomp_lock.h: Aucun fichier ou dossier de ce type.
(cuda-gdb) q
A debugging session is active.

    Inferior 1 [process 607745] will be killed.

Quit anyway? (y or n) y
[rkuate@odile-gpu01 developLast]$ ll -lt /tmp|head
total 44
drwxrwxrwx 2 rkuate prolb 6 8 nov. 11:07 cuda-dbg
-rw------- 1 rkuate prolb 0 8 nov. 10:48 dbus-session-monitor.DkDtMd
drwx------ 3 rkuate prolb 22 7 nov. 20:38 595136
drwx------ 8 rkuate prolb 91 7 nov. 20:29 ompi.odile-gpu01.1116
drwx------ 3 rkuate prolb 22 7 nov. 20:29 594778
drwx------ 3 rkuate prolb 22 7 nov. 20:28 594649
drwx------ 3 rkuate prolb 22 7 nov. 20:27 594506
drwx------ 3 rkuate prolb 22 7 nov. 20:21 594317
drwx------ 3 rkuate prolb 22 7 nov. 20:16 594040
[rkuate@odile-gpu01 developLast]$

However, it seems the kernels could not be loaded, when I run the code with gdb and break into std::terminate the backtrace seems not have entering the kernel
[rkuate@odile-gpu01 developLast]$ gdb --args build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver -i test_cases/CT01_pulse_uniform/input.txt -d test_cases
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-18.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type “show copying” and “show warranty” for details.
This GDB was configured as “x86_64-redhat-linux-gnu”.
Type “show configuration” for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type “help”.
Type “apropos word” to search for commands related to “word”…
Reading symbols from build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver…done.
(gdb) break std::terminate
Breakpoint 1 at 0x450f10
(gdb) run
Starting program: /local_scratch/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver -i test_cases/CT01_pulse_uniform/input.txt -d test_cases
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
Missing separate debuginfo for /odile/home/rkuate/libraries/system/usr/lib64/libatomic.so.1
Try: yum --enablerepo=‘debug’ install /usr/lib/debug/.build-id/8a/f1b6b137f44903c2abc2886f744789c3d1ab7e.debug
[New Thread 0x7fffebc69000 (LWP 607856)]
[New Thread 0x7fffeb468000 (LWP 607857)]
LBsolver 3.1.0 executed on: Wed Nov 08 11:13:55 2023 built on : Nov 7 13:27:27 2023

LBsolver build options | status | effect of build option

Charact. time of the sigma ramp [time step] = 8000
################# current time is 0/60 ################

Thread 1 “lbsolver” hit Breakpoint 1, 0x0000000000450f10 in std::terminate()@plt ()
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-225.el8.x86_64 libgcc-8.5.0-18.el8.x86_64 libstdc+±8.5.0-18.el8.x86_64 nvidia-driver-cuda-libs-535.104.05-1.el8.x86_64 zlib-1.2.11-21.el8_7.x86_64
(gdb) bt
#0 0x0000000000450f10 in std::terminate()@plt ()
#1 0x0000000000df9d80 in std::for_each (_T250_144510=0x49dff2b std::execution::par_unseq, __first=…, __last=…, __f=(unknown: 2196702720))
at /odile/home/rkuate/libraries/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/include/nvhpc/algorithm_execution.hpp:2344
#2 0x0000000000c434e1 in D3Q19HRRcore::DriverFunction (this=0x575d420, inodeBegin=1, inodeEnd=9556, tribal_status=TRIBAL_MONO, functions_to_launch=0x7fffffffba80, i_single_datas=0x7fffffffba68)
at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/hardlibrary/D3Q19HRRcore/D3Q19HRRcore_StdParallelTools.tpp:97
#3 0x0000000002a217da in Solver::delayExecuteRange (this=0x7fff820174c0, time_left=9.99999984e+17, delay=0, ielt_execute=8, range=…) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/solver/SolverDelay.cpp:467
#4 0x0000000002a22cdd in Solver::delayExecuteDelay (this=0x7fff820174c0, delay=0, time_left=@0x7fffffffbe68: 9.99999984e+17) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/solver/SolverDelay.cpp:709
#5 0x0000000002a22e10 in Solver::delayExecuteUntil (this=0x7fff820174c0, ielt_sync=6, time_limit=9.99999984e+17, max_execute_delay=7) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/solver/SolverDelay.cpp:749
#6 0x0000000002aeb929 in Solver::loop (this=0x7fff820174c0) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/solver/SolverLoop.cpp:212
#7 0x00000000024a1c8c in sequenceSolverLoop (process=0x7fff82013100) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/project/SolveProject.cpp:563
#8 0x0000000001f481e6 in ParallelEmulated::executeOnce (this=0x7fff82003f00, sequence=0x24a1c40 <sequenceSolverLoop(ProcessSolve*)>) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/parallel/ParallelEmulated.cpp:63
#9 0x00000000024a3578 in solveProject (parallel=0x7fff82003f00) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/project/SolveProject.cpp:923
#10 0x000000000190f5c8 in run (parallel=0x7fff82003f00) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/labs/run_LaBS.cpp:624
#11 0x000000000190e4e5 in tryLbsolver (argc=5, argv=0x7fffffffd5e8) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/labs/run_LaBS.cpp:503
#12 0x000000000190f6a9 in runLbsolver (argc=5, argv=0x7fffffffd5e8) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/labs/run_LaBS.cpp:644
#13 0x0000000001c4af3d in main (argc=5, argv=0x7fffffffd5e8) at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/mains/LaBS.cpp:18
(gdb)

In GUI mode the only message I can read is cudaErrorLaunchFailure(179)

Sincerely,

Hi @raphael.kuate
Having operator new overridden can cause such issues (GPU debugger back-end calls this operator).

One possible solution might be to switch to a legacy back-end (which only uses malloc). Please try running the debugging session with CUDBG_USE_LEGACY_DEBUGGER=1 environment variable.

Hi,

Your last suggestion gives the same result.

In fact overriding operator new has no effect since these calls are internal to so some std code. What is done is that we use another allocator when possible, and CUDA unified memory is used only on data shared among host and device. The reason is that we had some issues using CUDA unified memory for all the code (we had several exchanges with your organization about… with Gonzalo Brito, Niveditha Krishnamoorthy …) so, the solution was to use another allocator for std containers. However, we had no problem launching cuda-gdb in the previous version of the GPU driver, which targeted CUDA 11. So I tried to avoid using operator new when current cuda-gdb fails to launch the code, however as you can see, this is not always possible within these internal std calls to operator new.

On the other hand the main problem I have (if it can be solved without using cuda-gdb) is that the GPU kernels seems not to load, it may be related to the size of the GPU kernel being loaded. I’ve tried several options, printing ptx info was not showing code exceeding KBs. I however have this warning of which I founded no appropriate solution: tracking recursion in the code, custom stacklimit, using -Mchkstk with nvc++ …

<<

nvlink warning : Stack size for entry function ‘ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_10for_each_fI23ContiguousRangeIteratorNS_6detail16wrapped_functionI21ComputeNodeAllPhysicsILi0ELi0EEvEEEExEESC_xEEvT0_T1’ cannot be statically determined

So is there a way of knowing exactly if GPU kernels limits exceed ?

The profiler nvprof gave me this summary

<<

==3739841== Profiling application: build/cmake/lbsolver-gpuDebug-noMPI/build/lbsolver -i test_cases/CT01_pulse_uniform/input.txt -d test_cases/

==3739841== Warning: 2 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.

Howerer, the suggested --device-buffer-size option did not improve the profiler outcome.

Sincerely,

Raphaël.

Hi @raphael.kuate

In fact overriding operator new has no effect since these calls are internal to so some std code.

In the first message (in the backtrace) the memory allocation calls from libcudadebugger.so (GPU debugging library, which is loaded into the debugged application) are routed to the __pgi_managed_new which breaks the debugger library initialization.

However, we had no problem launching cuda-gdb in the previous version of the GPU driver, which targeted CUDA 11.

Debugger back-end was changed in CUDA 11.8. The new back-end is using C++ memory allocation while the old one (pre 11.8) used malloc.

Your last suggestion gives the same result.

Can you share the backtrace of the deadlock (as in the first message) with CUDBG_USE_LEGACY_DEBUGGER=1 environment variable. This variable should force old back-end (which doesn’t use C++ memory allocations).

On the other hand the main problem I have (if it can be solved without using cuda-gdb) is that the GPU kernels seems not to load, it may be related to the size of the GPU kernel being loaded.

Can you provide more details here - do you see any error messages printed (or CUDA calls returning error codes)?

We can also try reproducing this issue on our end - would it be possible for you to share the repro steps with us?

Hi,
I’m sorry but I did not check that the variable was correctly exported, or something went wrong with my terminal session on last tests. So, you are right, the debugger now initializes, I was just saying that the new operator was not overloaded.
The problem now is about stack limits. Here follows the errors
Debug version
CUDA Exception: Lane User Stack Overflow

Thread 1 “lbsolver” received signal CUDA_EXCEPTION_2, Lane User Stack Overflow.
[Switching focus to CUDA kernel 0, grid 2, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 1, lane 0]
0x00000000060d5bd0 in thrust::cuda_cub::for_each_f<ContiguousRangeIterator, thrust::detail::wrapped_function<ComputeNodeAllPhysics<0, 0>, void> >::operator() (this=, idx=) at /odile/home/rkuate/libraries/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/include-stdpar/thrust/system/cuda/detail/for_each.h:57
57 {

Release version

CUDA Exception: Lane User Stack Overflow

Thread 1 “lbsolver” received signal CUDA_EXCEPTION_2, Lane User Stack Overflow.
[Switching focus to CUDA kernel 0, grid 2, block (0,0,0), thread (96,0,0), device 0, sm 0, warp 0, lane 0]
0x000000000468e170 in ComputeNodeAllPhysics<0, 0>::operator() () at /odile_local/home/rkuate/workspace/prolb/lbsolver/gpu/developLast/src/hardlibrary/StdParallelTools.hpp:236
236 inline void operator()(size_t inode) {

The difference is that the entry point of the kernel is more precised on the release version.

Sincerely

This error indicates that you kernel threads are exceeding the stack limit (see 1. Introduction — CUDA-GDB 12.3 documentation ).

You can try using cudaDeviceSetLimit to increase the GPU stack size.

Hi,
Thanks a lot. The environment variable NVCOMPILER_ACC_CUDA_STACKSIZE works, since no cuda code is written but only standard c++ and stdpar for GPU. I will try later to mix cuda with standard c++ and stdpar algorithms for GPU.
Sincerely.

Hi!

Glad I was able to help! Is there anything else we can help with with regards to this question?

Hi,
Once more, thanks a lot. I’ll be back if further questions occurs. But at this point, my developments continues.

Thank you for the confirmation! Glad we were able to help!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.