cuDSS in a Fortran MPI program leading to segfault

I am trying to integrate cuDSS in a Fortran codebase, however, running the code to some errors, that are different from run to run:

[2025-07-15 19:51:27][CUDSS][2587975][Api][cudssCreate] start
[2025-07-15 19:51:27][CUSPARSE][2587975][Trace][cusparseCreate] cudaFree(0)
[2025-07-15 19:51:27][CUSPARSE][2587975][Trace][cusparseCreate] cudaGetDevice
[2025-07-15 19:51:27][CUSPARSE][2587975][Trace][cusparseCreate] cudaGetDeviceProperties(0)
[2025-07-15 19:51:27][CUSPARSE][2587975][Trace][cusparseCreate] cudaDriverGetVersion
[2025-07-15 19:51:27][CUSPARSE][2587975][Trace][cusparseCreate] cudaDeviceGetAttribute(115, 0)
[2025-07-15 19:51:27][CUSPARSE][2587975][Trace][cusparseCreate] cudaFuncGetAttributes
[2025-07-15 19:51:27][CUSPARSE][2587975][Api][cusparseCreate] handle[out]=0x563d37f0, version=12.5.9.5
[2025-07-15 19:51:27][CUDSS][2587975][Api][cudssConfigCreate] start
[2025-07-15 19:51:27][CUDSS][2587975][Api][cudssDataCreate] start
[2025-07-15 19:51:27][CUSPARSE][2587975][Api][cusparseSetStream] handle[in]=0x563d37f0, stream[in]=0x3c2d9450
[2025-07-15 19:51:27][CUDSS][2587975][Api][cudssSetStream] start
[2025-07-15 19:51:27][CUDSS][2587975][Api][cudssSetCommLayer] start
Using comm library: /home/eduard/Github/hawen_worktree/cudss/.cache/CPM/cudss/570e/lib/libcudss_commlayer_openmpi.so
[2025-07-15 19:51:27][CUDSS][2587975][Api][cudssDataSet] start
[2025-07-15 19:51:27][CUDSS][2587975][Api][cudssSetThreadingLayer] start
[2025-07-15 19:51:27][CUDSS][2587975][Info][cudssSetThreadingLayer] Default number of threads for the set threading layer = 12
************************************************************
********** solving for frequency       1 /       2
********** complex frequency :       0.4000  Hz +   0.0000
************************************************************
--------------------------------------------------------------------------------
==========> SYMMETRIC Matrix processing 
- wavelength min/max (SI)    :  2.50000E-01 2.50000E-01
- min dof per wavelength     : 
-    order   5 cells         :  3.62317E+01
- matrix global size         :      74898
- matrix approx. global nnz  :    1402542
- global memory for matrix   : 21.400 MiB
- mem/proc for ref matrices  : 20.672 KiB
- matrix creation time       : 1.359 sec
- matrix exact global nnz    :    1402542
[2025-07-15 19:51:28][CUDSS][2587975][Api][cudssMatrixCreateCsr] start
[2025-07-15 19:51:28][CUDSS][2587975][Api][cudssExecute] start
[2025-07-15 19:51:28][CUDSS][2587975][Info][cudssExecute] CUDSS_CONFIG_REORDERING_ALG 0 requires = 80437596 bytes (0.080437596 GB) in host memory
[2025-07-15 19:51:28][CUDSS][2587975][Info][cudssExecute] Using 12 threads on host for the reordering
[eduard-Pro-I5-11F-3060Ti:2587975:0:2587985] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7eac300e937c)
[eduard-Pro-I5-11F-3060Ti:2587975] *** Process received signal ***
[eduard-Pro-I5-11F-3060Ti:2587975] Signal: Segmentation fault (11)
[eduard-Pro-I5-11F-3060Ti:2587975] Signal code: Invalid permissions (2)
[eduard-Pro-I5-11F-3060Ti:2587975] Failing at address: 0x7eac3278a118
[eduard-Pro-I5-11F-3060Ti:2587975] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x45810) [0x7eacc4c45810]
[eduard-Pro-I5-11F-3060Ti:2587975] [ 1] ../../build/cudss-dev/code/app/forward_waveform_acoustic_isotropic_hdg.out() [0xdfcc04]
[eduard-Pro-I5-11F-3060Ti:2587975] [ 2] --------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node eduard-Pro-I5-11F-3060Ti exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Sometimes instead it fails instead with something like this

[2025-07-15 19:55:42][CUDSS][2623400][Api][cudssMatrixCreateCsr] start
[2025-07-15 19:55:42][CUDSS][2623400][Api][cudssExecute] start
[2025-07-15 19:55:42][CUDSS][2623400][Info][cudssExecute] CUDSS_CONFIG_REORDERING_ALG 0 requires = 80241076 bytes (0.080241076 GB) in host memory
[2025-07-15 19:55:42][CUDSS][2623400][Info][cudssExecute] Using 12 threads on host for the reordering
[1752602142.440458] [eduard-Pro-I5-11F-3060Ti:2623400:0]           debug.c:1301 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1752602142.440475] [eduard-Pro-I5-11F-3060Ti:2623400:2]           debug.c:1301 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[eduard-Pro-I5-11F-3060Ti:2623400:2:2623408] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x77bd7ca8a940)
[eduard-Pro-I5-11F-3060Ti:2623400:4:2623410] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x77bd7ca8a94c)
[1752602142.440492] [eduard-Pro-I5-11F-3060Ti:2623400:4]           debug.c:1301 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1752602142.440501] [eduard-Pro-I5-11F-3060Ti:2623400:5]           debug.c:1301 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1752602142.440519] [eduard-Pro-I5-11F-3060Ti:2623400:5]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1752602142.440531] [eduard-Pro-I5-11F-3060Ti:2623400:5]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[eduard-Pro-I5-11F-3060Ti:2623400:5:2623407] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x77bd7ca8a948)
[1752602142.440457] [eduard-Pro-I5-11F-3060Ti:2623400:1]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[eduard-Pro-I5-11F-3060Ti:2623400:1:2623404] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x77bd7ca8a938)
[1752602142.440508] [eduard-Pro-I5-11F-3060Ti:2623400:2]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1752602142.440520] [eduard-Pro-I5-11F-3060Ti:2623400:6]           debug.c:1301 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1752602142.440556] [eduard-Pro-I5-11F-3060Ti:2623400:6]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[eduard-Pro-I5-11F-3060Ti:2623400:6:2623403] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x77bd7ca8a93c)
[1752602142.440479] [eduard-Pro-I5-11F-3060Ti:2623400:3]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1752602142.440560] [eduard-Pro-I5-11F-3060Ti:2623400:7]           debug.c:1301 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1752602142.440508] [eduard-Pro-I5-11F-3060Ti:2623400:4]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[eduard-Pro-I5-11F-3060Ti:2623400:7:2623400] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x77bd7ca8a938)
[1752602142.440545] [eduard-Pro-I5-11F-3060Ti:2623400:1]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1752602142.440560] [eduard-Pro-I5-11F-3060Ti:2623400:6]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[eduard-Pro-I5-11F-3060Ti:2623400:3:2623405] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x77bd7cdd57dc)
[eduard-Pro-I5-11F-3060Ti:2623400:0:2623409] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x77bd7ca8a950)
[1752602142.440661] [eduard-Pro-I5-11F-3060Ti:2623400:3]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1752602142.440573] [eduard-Pro-I5-11F-3060Ti:2623400:7]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1752602142.440570] [eduard-Pro-I5-11F-3060Ti:2623400:8]           debug.c:1301 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1752602142.440682] [eduard-Pro-I5-11F-3060Ti:2623400:9]           debug.c:1301 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[eduard-Pro-I5-11F-3060Ti:2623400:8:2623401] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x77bd7ca8a944)
[1752602142.440692] [eduard-Pro-I5-11F-3060Ti:2623400:9]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1752602142.440685] [eduard-Pro-I5-11F-3060Ti:2623400:8]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[eduard-Pro-I5-11F-3060Ti:2623400:9:2623406] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x77bd7ca8a93c)
[1752602142.440665] [eduard-Pro-I5-11F-3060Ti:2623400:0]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1752602142.440692] [eduard-Pro-I5-11F-3060Ti:2623400:10]           debug.c:1301 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[eduard-Pro-I5-11F-3060Ti:2623400:10:2623411] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x77bd7ca8a938)
[eduard-Pro-I5-11F-3060Ti:2623400:11:2623402] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x77bd7ca8a93c)
[1752602142.440722] [eduard-Pro-I5-11F-3060Ti:2623400:11]           debug.c:1301 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1752602142.440723] [eduard-Pro-I5-11F-3060Ti:2623400:10]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1752602142.440700] [eduard-Pro-I5-11F-3060Ti:2623400:9]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
==== backtrace (tid:2623409) ====
 0 0x0000000000045810 __sigaction()  ???:0
 1 0x0000000000dfcc04 cuBucketSortKeysInc()  ???:0
 2 0x0000000000e04f54 cuMatch_SHEM()  ???:0
 3 0x0000000000e055dd cuCoarsenGraphNlevels()  ???:0
 4 0x0000000000e05637 cuMlevelNodeBisectionL2()  ???:0
 5 0x0000000000e05f66 cuMlevelNestedDissectionP_new()  ???:0
 6 0x000000000000808b cudssParallelFor._omp_fn.0()  tmpxft_0000012b_00000000-6_cudss_mtlayer_omp.cudafe1.cpp:0
 7 0x000000000010122b GOMP_ordered_end()  ???:0
 8 0x0000000000120fc9 __kmp_invoke_microtask()  ???:0
 9 0x0000000000085315 __kmp_fork_call()  ???:0
10 0x00000000000837f6 __kmp_fork_call()  ???:0
11 0x00000000000f8eee __kmpc_for_collapsed_init()  ???:0
12 0x00000000000a2ef1 pthread_condattr_setpshared()  ???:0
13 0x000000000013445c __clone()  ???:0
=================================
[eduard-Pro-I5-11F-3060Ti:2623400] *** Process received signal ***
[eduard-Pro-I5-11F-3060Ti:2623400] Signal: Segmentation fault (11)
[eduard-Pro-I5-11F-3060Ti:2623400] Signal code:  (-6)
[eduard-Pro-I5-11F-3060Ti:2623400] Failing at address: 0x3e8002807a8
[eduard-Pro-I5-11F-3060Ti:2623400] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x45810) [0x77be11645810]
[eduard-Pro-I5-11F-3060Ti:2623400] [ 1] ../../build/cudss-dev/code/app/forward_waveform_acoustic_isotropic_hdg.out() [0xdfcc04]
[eduard-Pro-I5-11F-3060Ti:2623400] [ 2] ../../build/cudss-dev/code/app/forward_waveform_acoustic_isotropic_hdg.out() [0xe04f54]
[eduard-Pro-I5-11F-3060Ti:2623400] [ 3] ../../build/cudss-dev/code/app/forward_waveform_acoustic_isotropic_hdg.out() [0xe055dd]
[eduard-Pro-I5-11F-3060Ti:2623400] [ 4] ../../build/cudss-dev/code/app/forward_waveform_acoustic_isotropic_hdg.out() [0xe05637]
[eduard-Pro-I5-11F-3060Ti:2623400] [ 5] ../../build/cudss-dev/code/app/forward_waveform_acoustic_isotropic_hdg.out() [0xe05f66]
[eduard-Pro-I5-11F-3060Ti:2623400] [ 6] /home/eduard/Github/hawen_worktree/cudss/.cache/CPM/cudss/570e/lib/libcudss_mtlayer_gomp.so(+0x808b) [0x77bda240808b]
[eduard-Pro-I5-11F-3060Ti:2623400] [ 7] /lib/x86_64-linux-gnu/libomp.so.5(+0x10122b) [0x77be777cb22b]
[eduard-Pro-I5-11F-3060Ti:2623400] [ 8] /lib/x86_64-linux-gnu/libomp.so.5(__kmp_invoke_microtask+0x99) [0x77be777eafc9]
[eduard-Pro-I5-11F-3060Ti:2623400] [ 9] /lib/x86_64-linux-gnu/libomp.so.5(+0x85315) [0x77be7774f315]
[eduard-Pro-I5-11F-3060Ti:2623400] [10] /lib/x86_64-linux-gnu/libomp.so.5(+0x837f6) [0x77be7774d7f6]
[eduard-Pro-I5-11F-3060Ti:2623400] [11] /lib/x86_64-linux-gnu/libomp.so.5(+0xf8eee) [0x77be777c2eee]
[eduard-Pro-I5-11F-3060Ti:2623400] [12] /lib/x86_64-linux-gnu/libc.so.6(+0xa2ef1) [0x77be116a2ef1]
[eduard-Pro-I5-11F-3060Ti:2623400] [13] /lib/x86_64-linux-gnu/libc.so.6(+0x13445c) [0x77be1173445c]
[eduard-Pro-I5-11F-3060Ti:2623400] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node eduard-Pro-I5-11F-3060Ti exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Sometimes with this

0: DEALLOCATE: memory at (nil) not allocated
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[28935,1],0]
  Exit code:    127
--------------------------------------------------------------------------

The last one in particular happens every time I run the program through GDB, making it difficult to find the problem. I am launching the executable with

mpirun --bind-to none -np 1  ../../build/cudss-dev/code/app/forward_waveform_acoustic_isotropic_hdg.out parameter=par.modeling_acoustic

The .cpp file that wraps around cuDSS is compiled with the following command

/opt/nvidia/hpc_sdk/Linux_x86_64/25.5/compilers/bin/nvc++ -DCUDSS_STATIC_LIBRARY -DHAWEN_CUDSS_COMM_LIB_PATH=\\\"/home/eduard/Github/hawen_worktree/cudss/.cache/CPM/cudss/570e/lib/libcudss_commlayer_openmpi.so\\\" -DHAWEN_CUDSS_GOMP_LIB_PATH=\\\"/home/eduard/Github/hawen_worktree/cudss/.cache/CPM/cudss/570e/lib/libcudss_mtlayer_gomp.so\\\" -DHAWEN_ENABLE_ASSERTIONS -DHAWEN_FORTRAN_IKIND_MAT=i4 -DHAWEN_FORTRAN_IKIND_MESH=i4 -DHAWEN_FORTRAN_IKIND_METIS=i4 -DHAWEN_FORTRAN_RKIND_MAT=sp -DHAWEN_FORTRAN_RKIND_MESH=dp -DHAWEN_FORTRAN_RKIND_METIS=sp -DHAWEN_FORTRAN_RKIND_POL=dp -DHAWEN_USE_CUDSS -I/home/eduard/Github/hawen_worktree/cudss/code/src/macros -I/home/eduard/Github/hawen_worktree/cudss/.cache/CPM/metis_fc/8987/libmetis -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/include -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/lib -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/include/openmpi -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/include/openmpi/opal/mca/hwloc/hwloc201/hwloc/include -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/include/openmpi/opal/mca/event/libevent2022/libevent -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/include/openmpi/opal/mca/event/libevent2022/libevent/include -isystem /home/eduard/Github/hawen_worktree/cudss/.cache/CPM/metis_fc/8987/include -isystem /home/eduard/Github/hawen_worktree/cudss/.cache/CPM/cudss/570e/include -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/cuda/12.9/targets/x86_64-linux/include -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/math_libs/12.9/include -g -O0 -std=gnu++20 -Wall -Wextra -pthread -o code/src/CMakeFiles/hawen_lib.dir/linear-algebra/solvers/cuDSS/solver.cpp.o -c /home/eduard/Github/hawen_worktree/cudss/code/src/linear-algebra/solvers/cuDSS/solver.cpp"

while the other Fortran files, for example the one that wraps the cpp calls, are linked with this flags

/opt/nvidia/hpc_sdk/Linux_x86_64/25.5/compilers/bin/nvfortran -DCUDSS_STATIC_LIBRARY -DHAWEN_CUDSS_COMM_LIB_PATH=\\\"/home/eduard/Github/hawen_worktree/cudss/.cache/CPM/cudss/570e/lib/libcudss_commlayer_openmpi.so\\\" -DHAWEN_CUDSS_GOMP_LIB_PATH=\\\"/home/eduard/Github/hawen_worktree/cudss/.cache/CPM/cudss/570e/lib/libcudss_mtlayer_gomp.so\\\" -DHAWEN_ENABLE_ASSERTIONS -DHAWEN_FORTRAN_IKIND_MAT=i4 -DHAWEN_FORTRAN_IKIND_MESH=i4 -DHAWEN_FORTRAN_IKIND_METIS=i4 -DHAWEN_FORTRAN_RKIND_MAT=sp -DHAWEN_FORTRAN_RKIND_MESH=dp -DHAWEN_FORTRAN_RKIND_METIS=sp -DHAWEN_FORTRAN_RKIND_POL=dp -DHAWEN_USE_CUDSS -I/home/eduard/Github/hawen_worktree/cudss/code/src/macros -I/home/eduard/Github/hawen_worktree/cudss/.cache/CPM/metis_fc/8987/libmetis -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/include -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/lib -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/include/openmpi -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/include/openmpi/opal/mca/hwloc/hwloc201/hwloc/include -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/include/openmpi/opal/mca/event/libevent2022/libevent -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/include/openmpi/opal/mca/event/libevent2022/libevent/include -isystem /home/eduard/Github/hawen_worktree/cudss/.cache/CPM/metis_fc/8987/include -isystem /home/eduard/Github/hawen_worktree/cudss/.cache/CPM/cudss/570e/include -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/cuda/12.9/targets/x86_64-linux/include -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/math_libs/12.9/include -g -O0 -Mbounds -module include -Wall -Wextra -mp -pthread -c /home/eduard/Github/hawen_worktree/cudss/code/src/linear-algebra/solvers/cuDSS/m_cudss_solver.f90 -o code/src/CMakeFiles/hawen_lib.dir/linear-algebra/solvers/cuDSS/m_cudss_solver.f90.o

If needed I can provide the sources of the C++ file, it’s not particularly big. Thank you in advantance

I’ve not used cuDSS myself so not sure how much help I’ll be here. I might need to send you over the CUDA accelerator library forum.

Though, the one thing I see is that the GNU OpenMP runtime is being used, likely coming from “libcudss_mtlayer_gomp.so”, but also using the NVHPC OpenMP runtime (via the compiler flag “-mp”).

OpenMP runtimes don’t mix well together as one may intercept calls from the other leading to undefined behavior. No idea is that’s what’s happening here, but does cuDSS have a non-OpenMP version? or can you change “-mp” to “-nomp” so the NVHPC runtime isn’t linked in?

1 Like

I don’t know if it’s the same message because they vary a bit, but very similar

[2025-07-16 00:28:37][CUDSS][636126][Api][cudssMatrixCreateCsr] start
[2025-07-16 00:28:37][CUDSS][636126][Api][cudssExecute] start
[2025-07-16 00:28:37][CUDSS][636126][Info][cudssExecute] CUDSS_CONFIG_REORDERING_ALG 0 requires = 97819960 bytes (0.09781996 GB) in host memory
[2025-07-16 00:28:37][CUDSS][636126][Info][cudssExecute] Using 12 threads on host for the reordering
[eduard-Pro-I5-11F-3060Ti:636126:0:638867] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x793358000000)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
[eduard-Pro-I5-11F-3060Ti:636126] *** Process received signal ***
[eduard-Pro-I5-11F-3060Ti:636126] Signal: Segmentation fault (11)
[eduard-Pro-I5-11F-3060Ti:636126] Signal code:  (128)
[eduard-Pro-I5-11F-3060Ti:636126] Failing at address: (nil)
[eduard-Pro-I5-11F-3060Ti:636126] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x45810) [0x7933ae645810]
[eduard-Pro-I5-11F-3060Ti:636126] [ 1] ../../build/cudss-dev/code/app/forward_waveform_acoustic_isotropic_hdg.out() [0xdefe83]
[eduard-Pro-I5-11F-3060Ti:636126] [ 2] ../../build/cudss-dev/code/app/forward_waveform_acoustic_isotropic_hdg.out() [0xdf0cae]
[eduard-Pro-I5-11F-3060Ti:636126] [ 3] ../../build/cudss-dev/code/app/forward_waveform_acoustic_isotropic_hdg.out() [0xdf0d73]
[eduard-Pro-I5-11F-3060Ti:636126] [ 4] ../../build/cudss-dev/code/app/forward_waveform_acoustic_isotropic_hdg.out() [0xdf0f73]
BFD: [eduard-Pro-I5-11F-3060Ti:636126] DWARF error: section .debug_info is larger than its filesize! (0x4e8e58 vs 0x44a6a0)
[ 5] ../../build/cudss-dev/code/app/forward_waveform_acoustic_isotropic_hdg.out() [0xdf1866]
[eduard-Pro-I5-11F-3060Ti:636126] [ 6] /home/eduard/Github/hawen_worktree/cudss/.cache/CPM/cudss/570e/lib/libcudss_mtlayer_gomp.so(+0x808b) [0x79337640808b]
[eduard-Pro-I5-11F-3060Ti:636126] [ 7] /lib/x86_64-linux-gnu/libomp.so.5(+0x10122b) [0x7934147cb22b]
[eduard-Pro-I5-11F-3060Ti:636126] [ 8] /lib/x86_64-linux-gnu/libomp.so.5(__kmp_invoke_microtask+0x99) [0x7934147eafc9]
[eduard-Pro-I5-11F-3060Ti:636126] [ 9] /lib/x86_64-linux-gnu/libomp.so.5(+0x85315) [0x79341474f315]
[eduard-Pro-I5-11F-3060Ti:636126] [10] /lib/x86_64-linux-gnu/libomp.so.5(+0x837f6) [0x79341474d7f6]
[eduard-Pro-I5-11F-3060Ti:636126] [11] /lib/x86_64-linux-gnu/libomp.so.5(+0xf8eee) [0x7934147c2eee]
[eduard-Pro-I5-11F-3060Ti:636126] [12] /lib/x86_64-linux-gnu/libc.so.6(+0xa2ef1) [0x7933ae6a2ef1]
[eduard-Pro-I5-11F-3060Ti:636126] [13] /lib/x86_64-linux-gnu/libc.so.6(+0x13445c) [0x7933ae73445c]
[eduard-Pro-I5-11F-3060Ti:636126] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node eduard-Pro-I5-11F-3060Ti exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

when compiling with

opt/nvidia/hpc_sdk/Linux_x86_64/25.5/compilers/bin/nvfortran -DCUDSS_STATIC_LIBRARY -DHAWEN_CUDSS_COMM_LIB_PATH=\\\"/home/eduard/Github/hawen_worktree/cudss/.cache/CPM/cudss/570e/lib/libcudss_commlayer_openmpi.so\\\" -DHAWEN_CUDSS_GOMP_LIB_PATH=\\\"/home/eduard/Github/hawen_worktree/cudss/.cache/CPM/cudss/570e/lib/libcudss_mtlayer_gomp.so\\\" -DHAWEN_ENABLE_ASSERTIONS -DHAWEN_FORTRAN_IKIND_MAT=i4 -DHAWEN_FORTRAN_IKIND_MESH=i4 -DHAWEN_FORTRAN_IKIND_METIS=i4 -DHAWEN_FORTRAN_RKIND_MAT=sp -DHAWEN_FORTRAN_RKIND_MESH=dp -DHAWEN_FORTRAN_RKIND_METIS=sp -DHAWEN_FORTRAN_RKIND_POL=dp -DHAWEN_USE_CUDSS -I/home/eduard/Github/hawen_worktree/cudss/code/src/macros -I/home/eduard/Github/hawen_worktree/cudss/.cache/CPM/metis_fc/8987/libmetis -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/include -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/lib -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/include/openmpi -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/include/openmpi/opal/mca/hwloc/hwloc201/hwloc/include -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/include/openmpi/opal/mca/event/libevent2022/libevent -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/include/openmpi/opal/mca/event/libevent2022/libevent/include -isystem /home/eduard/Github/hawen_worktree/cudss/.cache/CPM/metis_fc/8987/include -isystem /home/eduard/Github/hawen_worktree/cudss/.cache/CPM/cudss/570e/include -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/cuda/12.9/targets/x86_64-linux/include -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/math_libs/12.9/include -g -O0 -Mbounds -module include -Wall -Wextra -nomp -pthread -c /home/eduard/Github/hawen_worktree/cudss/code/src/linear-algebra/solvers/cuDSS/m_cudss_solver.f90 -o code/src/CMakeFiles/hawen_lib.dir/linear-algebra/solvers/cuDSS/m_cudss_solver.f90.o

Hi @occhipinti.eduard!

The errors you see indeed can be caused by OpenMP runtime clash in case you have several of them.

First suggestion is to confirm that the issues are related with OpenMP and run cuDSS without MT mode (= without OpenMP, this can be achieved by just commenting out cudssSetThreadingLayer() from your code).

I assume, without MT mode, cuDSS will succeed (or at least show a very different type of error).

Second suggestion: to eliminate the possibility that there is a problem with the multi-threaded reordering implementation within cuDSS, you can check if the issue is still there if you run with a single thread.

I suspect, execution will still fail (because that error “Segmentation fault: invalid permissions for mapped object” is too suspicious too me)

Third suggestion: I suspect @MatColgrove gave you a good place to dig into. What you can do is build your own threading layer for cuDSS with NVHPC runtime. Then you will avoid the clash with GNU OpenMP by simply not having a dependency on it.

Fourth suggestion (the most technical one for debugging aficionados): you can replace cudss with a dummy code which does a dlopen on the threading layer library and uses dlsym to get the symbols out of it and then just call the cudssGetMaxThreads(). Ideally you can build this code into a dummy *.so and link it to GNU OpenMP and use it as a proxy for cudss.

I mean smth like that:

// small library
#include <dlfcn.h>

 // some API which exposes the result like
 int proxy_get_max_threads() {

    // ...

    cudssThreadingInterface_t *thrIface;
    void *thrIfaceLib;

    char *libname = ...;
    thrIfaceLib = static_cast<void*>(dlopen(libname, RTLD_NOW));
    if (thrIfaceLib == NULL) {
        // bad
    }
    Iface = (cudssThreadingInterface_t*)dlsym(thrIfaceLib, "cudssThreadingInterface");
    if (thrIface == NULL) {
        // bad
    }
    return thrIface->GetMaxTheads();
  }
    
// in the app:

  proxy_get_max_threads();

I hope one of the first three suggestions help =) I’d say, maybe the third one = getting rid of GNU OpenMP dependency and keep only one threading runtime (and use it in cuDSS) seems to be the most correct solution to me.

Thanks,
Kirill

1 Like

I think you are right! It seems to work on 1 thread 1 MPI. I am now trying to integrate the shared library for the threading and MPI backends in the CMake build, the problem I am facing now is that the provided script is giving me some errors on some math functions

âžś  src git:(anotinv) module load nvhpc/25.5
âžś  src git:(anotinv) echo $NVHPC_ROOT
/opt/nvidia/hpc_sdk/Linux_x86_64/25.5
âžś  src git:(anotinv) ls $NVHPC_ROOT                                                         
cmake  comm_libs  compilers  cuda  examples  math_libs  profilers  REDIST
âžś  src git:(anotinv) find -L $NVHPC_ROOT -name "omp.h"
/opt/nvidia/hpc_sdk/Linux_x86_64/25.5/compilers/include/omp.h
âžś  src git:(anotinv) CUDA_PATH=${NVHPC_ROOT}/cuda OPENMP_PATH=${NVHPC_ROOT}/compilers ./cudss_build_mtlayer.sh gomp
Building communication layer with gomp backend
nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
/usr/include/x86_64-linux-gnu/bits/mathcalls.h(79): error: exception specification is incompatible with that of previous function "cospi" (declared at line 2601 of /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/cuda/include/crt/math_functions.h)
   extern double cospi (double __x) noexcept (true); extern double __cospi (double __x) noexcept (true);
                                    ^

/usr/include/x86_64-linux-gnu/bits/mathcalls.h(81): error: exception specification is incompatible with that of previous function "sinpi" (declared at line 2556 of /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/cuda/include/crt/math_functions.h)
   extern double sinpi (double __x) noexcept (true); extern double __sinpi (double __x) noexcept (true);
                                    ^

/usr/include/x86_64-linux-gnu/bits/mathcalls.h(79): error: exception specification is incompatible with that of previous function "cospif" (declared at line 2623 of /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/cuda/include/crt/math_functions.h)
   extern float cospif (float __x) noexcept (true); extern float __cospif (float __x) noexcept (true);
                                   ^

/usr/include/x86_64-linux-gnu/bits/mathcalls.h(81): error: exception specification is incompatible with that of previous function "sinpif" (declared at line 2579 of /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/cuda/include/crt/math_functions.h)
   extern float sinpif (float __x) noexcept (true); extern float __sinpif (float __x) noexcept (true);
                                   ^

4 errors detected in the compilation of "cudss_mtlayer_omp.cu".
ls: cannot access 'libcudss_mtlayer_gomp.so': No such file or directory
âžś  src git:(anotinv) 

Edit: to add a bit more context, the line referenced in the errors in mathcalls.h are the following (the ones with cospi and sinpi):

#if __GLIBC_USE (IEC_60559_FUNCS_EXT_C23)
/* Arc cosine of X, divided by pi.  */
__MATHCALL (acospi,, (_Mdouble_ __x));
/* Arc sine of X, divided by pi.  */
__MATHCALL (asinpi,, (_Mdouble_ __x));
/* Arc tangent of X, divided by pi.  */
__MATHCALL (atanpi,, (_Mdouble_ __x));
/* Arc tangent of Y/X, divided by pi.  */
__MATHCALL (atan2pi,, (_Mdouble_ __y, _Mdouble_ __x));

/* Cosine of pi * X.  */
__MATHCALL_VEC (cospi,, (_Mdouble_ __x));
/* Sine of pi * X.  */
__MATHCALL_VEC (sinpi,, (_Mdouble_ __x));
/* Tangent of pi * X.  */
__MATHCALL_VEC (tanpi,, (_Mdouble_ __x));
#endif

I am on Ubuntu 25.04

Edit 2; apparently it’s a known bug: "error: exception specification is incompatible" for cospi/sinpi/cospif/sinpif with glibc-2.41 - #3 by stefantalpalaru

It does seem to work now, thank you! It is strangely slow though (exactly 10 times slower in fact), compared to CPU-only MUMPS, even though I see that cuDSS is using fully the CPU and GPU

Great that you have resolved the issues!

About performance: this is not expected.

There are many details which might provide some hints to us.

How large are your matrices?
What is the GPU?
Could you share the output from having CUDSS_LOG_LEVEL=5 in the environment?
If you use MGMN mode, what do you use for communication backend, OpenMPI or NCCL?
Which non-default settings do you use for cuDSS?
Do you use any special features from MUMPS?
What is time comparison for individual phases (analysis, factorization, solve)?

Or, you can share your matrix data (in matrix market format, possible archived) and we can check performance on our end.

Thanks,
Kirill

Thank you for the availability! I have been able to narrow down the error in the conversion I was performing from the original one-indexed, unordered and with duplicates COO matrix to the CSR one cuDSS expects. I will soon set up a benchmark suite for our software and do some more accurate tests

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.