CUBLAS 5.5 shutdown segfaults on ubuntu 12.04 LTS


I am running into a ton of weird issues with 5.5 that do not show up with 5.0, even when using the 5.5 driver and the 5.0 toolkit. I have a bunch of internal tests that compile and link against the CUDA runtime and CUBLAS (legacy or V2, doesn’t matter which one) that all use pinned memory but never initialise the device and in particular never use cublas at all: My code makes a runtime decision if a given array is malloc’ed on the CPU, in CPU pinned memory, or on the device.

Basically, I’m testing if I can run all my CPU kernels on arrays allocated through cudaHostAlloc(…,default), which should be perfectly legitimate.

Problem is: All these tests die horribly at shutdown with stack traces originating from within cublas, even though I never ever even initialise a CUDA/CUBLAS context! Example (this is a full trace, it happens after my valgrind-safe application calls ‘exit’):

  • /lib/x86_64-linux-gnu/ [0x2b3d7efd44a0]
  • /lib64/ [0x2b3d7a316181]
  • /lib64/ [0x2b3d7a310176]
  • /lib/x86_64-linux-gnu/ [0x2b3d7f65f52f]
  • /lib/x86_64-linux-gnu/ [0x2b3d7f65f00f]
  • /sfw/cuda/5.5/lib64/ [0x2b3d7a933e6c]
  • /sfw/cuda/5.5/lib64/ [0x2b3d7a933f75]
  • /lib/x86_64-linux-gnu/ [0x2b3d7efd9d1d]
  • /sfw/cuda/5.5/lib64/ [0x2b3d7a790ec6]

Unfortunately, all my attempts to boil this down to a simple reproducer failed. So my question is twofold: Has anyone ever seen this before? Is there a way to statically link against CUDA/CUBLAS, since the issue is apparently within .so unloading mechanism of the Linux kernel (I’m afraid static linking is unsupported)?

To increase the fun, this only happens in debug builds, it does not happen in optimised builds and it especially does not happen when running through gdb or ddt. Sigh. All cuda/cublas-related functionality that is build into my lib does not use global variables – which are occasionally “invisible” in gdb.

Thanks for shots in the dark, I’m honestly lost…


PS: a few more traces obtained by connecting to the running app with gdb:

masterslave () at
89        call par_postexit
(gdb) cont
91      end program masterslave
(gdb) cont
__libc_start_main (main=0x6b1b84 <main>, argc=2, ubp_av=0x7fffba6785f8,
init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7fffba6785e8) at libc-start.c:258
258     libc-start.c: No such file or directory.
(gdb) cont

Program received signal SIGSEGV, Segmentation fault.
_dl_close (_map=0x600000002) at dl-close.c:757
757     dl-close.c: No such file or directory.
Program received signal SIGSEGV, Segmentation fault.
_dl_close (_map=0x600000002) at dl-close.c:757
757     dl-close.c: No such file or directory.
(gdb) bt
#0  _dl_close (_map=0x600000002) at dl-close.c:757
#1  0x00002b3226c77176 in _dl_catch_error (objname=0xf0d3020,
errstring=0xf0d3028, mallocedp=0xf0d3018, operate=0x2b322bbccfe0
    args=0x600000002) at dl-error.c:178
#2  0x00002b322bbcd52f in _dlerror_run (operate=0x2b322bbccfe0
<dlclose_doit>, args=0x600000002) at dlerror.c:164
#3  0x00002b322bbcd00f in __dlclose (handle=0x600000002) at dlclose.c:48
#4  0x00002b322729be6c in ?? () from /sfw/cuda/5.5/lib64/
#5  0x00002b322729bf75 in ?? () from /sfw/cuda/5.5/lib64/
#6  0x00002b322b631d1d in __cxa_finalize (d=0x2b322a57f7d0) at
#7  0x00002b32270f8ec6 in ?? () from /sfw/cuda/5.5/lib64/
#8  0x0000000000000018 in ?? ()
#9  0x00007fffb540e880 in ?? ()
#10 0x00007fffb540e9b0 in ?? ()
#11 0x00002b32272c3ca1 in ?? () from /sfw/cuda/5.5/lib64/
#12 0x00007fffb540e9b0 in ?? ()
#13 0x00002b3226c7792d in _dl_fini () at dl-fini.c:259
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Could you please file a bug with a small repro?
The issue may already be fixed for 5.5, but we would like to be sure.

Sorry, I’ve been a bit imprecise: The regression only occurs with CUDA 5.5, all is fine with 5.0. It even works well if I use the 5.5 driver with the 5.0 toolkit.

What exactly do you say? The issue might have been fixed in 5.5_final or in 6.0?

I am still unable to boil this down to a small reproducer (all my attempts do not crash), and the full app is probably to big to serve as a repro case. I have this big fat Fortran code that needs some BLAS for host portions, and that links against a static library that contains C glue wrappers for CUDA kernels, CUBLAS calls, and some pinned memory. So all possibly needed CUBLAS symbols are there (according to nm), but I am 100% sure that I never ever create a CUDA context. Is there a magic routine to check this – so far I have been under the impression that contexts are created upon the first CUDA call and not because some CUDA is linked in but never used, and there is positively none in this particular regression.

To me, it simply does not make sense that something should segfault in cublas when the application has already shut down, as I definitely never call cublasInit() nor cublasCreate() nor any CUBLAS function. The traces above indicate stack trashing.

I also tried to swap the underlying host BLAS: same behaviour for MKL, OpenBLAS, Goto2 and same behaviour if I dynlink or statically link to them.

If you really want the full repro case, please contact me via email. I can share the full code, but I guess if I submit a bug with a lengthy description on how to actually reproduce it, it will go unnoticed.


I wasn’t clear, the bug is present in 5.5RC but it should be fixed in 5.5 final.
I said “might” because we think your issue is similar to another one that we have fixed but
without a repro we can’t be 100% sure.
Filing an official bug is the best way to get bug fixed and to provide repro cases, it will not go unnoticed.
On the other hand, a forum post may get attention from the right people or it can go completely unnoticed .