cublas SEGFAULT in cublasInit() cublas SEGFAULT in cublasInit() but locally compiled examples run.

When I try to make any call in cublas including cublasInit(), I get a segfault.
I’m using gcc/g++ 4.4.0 on RedHat 5.4.
I can compile and run all the SDK examples with g++44. I can even run this code (the code that segfaults) when I call it from a different executable. The nefarious code is just a simple matrix multiply, but it doesn’t even matter what it is I never make it past cublasInit(). Its located in a C++ class which is compiled into a shared library (so) along with hundreds of others. The executable links in this library (so).

When I run it linked with the emulator version ( it works fine. When I run it linked into a different executable (our unit tests) it works fine. Run time api calls work. Its runs in one executable but not the other. Has Anybody else seen this happen? Are there any compile/link flags that interfere with cublas at runtime?

Here is the core back trace:
(gdb) bt
[i]#0 0x00002aaaab957980 in ?? () from /usr/lib64/
#1 0x00002aaaab95d3c4 in ?? () from /usr/lib64/
#2 0x00002aaaab92d557 in ?? () from /usr/lib64/
#3 0x00002aaaab8d8cf7 in ?? () from /usr/lib64/
#4 0x00002aaaab8ea52b in ?? () from /usr/lib64/
#5 0x00002aaaab8cf940 in ?? () from /usr/lib64/
#6 0x00002aaaab8c8a8a in ?? () from /usr/lib64/
#7 0x00002aaaab923187 in ?? () from /usr/lib64/
#8 0x00002b63ea71beb2 in ?? () from /opt/brs/lib/
#9 0x00002b63ea71c69c in ?? () from /opt/brs/lib/
#10 0x00002b63ea70081d in cudaFree () from /opt/brs/lib/
#11 0x00002b63ea93d110 in cublasInitCtx () from /opt/brs/lib/
#12 0x00002b63ea9871f7 in ?? () from /opt/brs/lib/
#13 0x00002b63ea93d2b0 in cublasInit () from /opt/brs/lib/
#14 0x00002b63e452bd63 in brs::util::CudaTestClass::simpleSinglePrecisionMatrixMutlitp
y (this=)
at …/…/brsVAE/src/vaelib/util/CudaTestClass.cpp:92
#15 0x00002b63e553314b in thread_proxy () from /opt/brs/lib/
#16 0x00000032bd6064a7 in start_thread () from /lib64/
#17 0x00000032bcad3c2d in clone () from /lib64/

gcc 4.4 isn’t supported. That looks suspiciously like a runtime code incompatibility. Try using gcc-4.3 or earlier instead (I think the gcc-4.4 install is officailly a “preview” version in Redhat – there is still a gcc-4.1 version available as the mainline compiler)

Yeah 4.4.0 is a preview. I rolled back to gcc/g++ 4.1.2 and I still get the segfault.

Can you think of anything else?

Is there source available for cublas 2.0 or at least a shared lib with debug symbols?

There is no source available for the modern cublas I am afraid. Am I right in thinking you have wrapped up cublas in some sort of C++ class of wrapper function? This is a wild guess, but It might well be that you need to use plain C malloc rather than the C++ new operator for allocating host side storage. I personally have never seen anything like this and I have used cublas pretty exensively in my own codes.

I don’t make it far enough to pass a variable. I can remove everything but cublasInit() and it still fails.

So this would segfault when I call test() in side our exe.

class CudaTest



    void test() { cublasInit(); }


What driver and toolkit versions are you using?

Tool Kit 2.3

Driver 190.53

Are there any know conflict with other libraries like Intel Performance Primitives (IPP) or intel math kernel library (mkl) or OMP

I gave up on cublas for a while and I am writing my own matrix multiply. Now am I am running into the same kind of problem with cudaMalloc(). The first time I ever call it in our main exe I get a segfault. If I call the same routine from our unit test exe it works. If I run the main exe in the cuda-gdb I get cudaErrorNoDevice as a return from malloc but no segfault.

For a while I was getting a back-trace that had showed the segfault was in libiomp which has something to do with OMP in mkl. Now I am back to the segafault in cudaMalloc().

Still Frustrated.

The toolkit and SDK versions you have are OK. Sometimes the sort of symptoms you are seeing can be caused by newer toolkit versions on very old drivers. CUDA coexists with MKL to the best of my knowledge.

Are you sure that the actual driver installation is ok? Can you build and run the deviceQuery example from the SDK, for example? It might be that the driver you have either doesn’t have with it or it is hosed somehow. Is the driver from the NVIDIA installer or a third party rpm repackage?

Friday I bought a newer card (Geforce GTS 250) and resinstalled the latest NVidia Driver 190.53, with no effect . I can build and run all the SDK examples (much faster now) even with g++ 4.4.0. The SDK and tools are not repackaged. They are from the nvidia cuda download area (2.3).

I did find one interesting thing. When I change the env var OMP_NUM_THREADS (used with MKL) I can move the segfault location around. If OMP_NUM_THREADS=1 we segfault in the cuda lib trying to do a cudaMalloc ( if its > 1 we segfault in libiomp5 (part of MKL). I am working on removing the MKL calls from our exe to see if that makes a difference.

Removing MKL didn’t help. But I did manage to get past this problem. If I call cublasInit() very early in my main() function then everything works fine. Our stuff is heavily multri-threaded so I have been making all my calls to cublas or cuda from inside one of these threads that don’t launch until after few seconds into a run. It always segfaulted there. And I only have to call init there. I can call other cublas from anywhere.

I am not sure why this happened. Maybe its got something to do with how the device gets mapped into memory by the exe. Myabe it was in the docs and I missed it. I have no idea. But its working now.

The device context that the runtime API/CUBLAS establishes on the GPU is thread specific. If you want many threads to be able to use the same GPU you will have to use the context thread migration API (I don’t know how it works, only that it exists). The alternative is to have a specific, persistent worker thread hold the GPU context and send it CUBLAS work. This is how I have implemented it in one of my apps.

It might have been worth mentioning you app was multithreaded at the beginning of all this, it would have made pinning down your problem a lot faster…