Segmentation fault in __pthread_getspecific called from libcuda.so.1

Problem: Segmentation fault (SIGSEGV, signal 11)

Brief program description:

  • high performance gpu (CUDA) server handling requests from remote
    clients
  • each incoming request spawns a thread that performs
    calculations on multiple GPU’s (serial, not in parallel) and sends
    back a result to the client, this usually takes anywhere between 10-200ms as each request consists of tens or hundreds of kernel calls
  • request handler threads have exclusive access to GPU’s, meaning that if one thread is running something on GPU1 all others will have to wait until its done
  • compiled with -arch=sm_35 -code=compute_35
  • using CUDA 5.0
  • i’m not using any CUDA atomics explicitly or any in-kernel synchronization barriers, though i’m using thrust (various functions) and cudaDeviceSynchronize() obviously
  • Nvidia driver: NVIDIA dlloader X Driver 313.30 Wed Mar 27 15:33:21 PDT 2013

OS and HW info:

  • Linux lub1 3.5.0-23-generic #35~precise1-Ubuntu x86_64 x86_64 x86_64 GNU/Linux
  • GPU’s: 4x GPU 0: GeForce GTX TITAN
  • 32 GB RAM
  • MB: ASUS MAXIMUS V EXTREME
  • CPU: i7-3770K

Crash information:

Crash occurs “randomly” after a couple of thousands requests are handled (sometimes sooner, sometimes later). Stack traces from some of the crashes look like this:

#0  0x00007f8a5b18fd91 in __pthread_getspecific (key=4) at pthread_getspecific.c:62
#1  0x00007f8a5a0c0cf3 in ?? () from /usr/lib/libcuda.so.1
#2  0x00007f8a59ff7b30 in ?? () from /usr/lib/libcuda.so.1
#3  0x00007f8a59fcc34a in ?? () from /usr/lib/libcuda.so.1
#4  0x00007f8a5ab253e7 in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#5  0x00007f8a5ab484fa in cudaGetDevice () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#6  0x000000000046c2a6 in thrust::detail::backend::cuda::arch::device_properties() ()


#0  0x00007ff03ba35d91 in __pthread_getspecific (key=4) at pthread_getspecific.c:62
#1  0x00007ff03a966cf3 in ?? () from /usr/lib/libcuda.so.1
#2  0x00007ff03aa24f8b in ?? () from /usr/lib/libcuda.so.1
#3  0x00007ff03b3e411c in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#4  0x00007ff03b3dd4b3 in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#5  0x00007ff03b3d18e0 in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#6  0x00007ff03b3fc4d9 in cudaMemset () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#7  0x0000000000448177 in libgbase::cudaGenericDatabase::cudaCountIndividual(unsigned int, ...


#0  0x00007f01db6d6153 in ?? () from /usr/lib/libcuda.so.1
#1  0x00007f01db6db7e4 in ?? () from /usr/lib/libcuda.so.1
#2  0x00007f01db6dbc30 in ?? () from /usr/lib/libcuda.so.1
#3  0x00007f01db6dbec2 in ?? () from /usr/lib/libcuda.so.1
#4  0x00007f01db6c6c58 in ?? () from /usr/lib/libcuda.so.1
#5  0x00007f01db6c7b49 in ?? () from /usr/lib/libcuda.so.1
#6  0x00007f01db6bdc22 in ?? () from /usr/lib/libcuda.so.1
#7  0x00007f01db5f0df7 in ?? () from /usr/lib/libcuda.so.1
#8  0x00007f01db5f4e0d in ?? () from /usr/lib/libcuda.so.1
#9  0x00007f01db5dbcea in ?? () from /usr/lib/libcuda.so.1
#10 0x00007f01dc11e0aa in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#11 0x00007f01dc1466dd in cudaMemcpy () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#12 0x0000000000472373 in thrust::detail::backend::cuda::detail::b40c_thrust::BaseRadixSortingEnactor


#0  0x00007f397533dd91 in __pthread_getspecific (key=4) at pthread_getspecific.c:62
#1  0x00007f397426ecf3 in ?? () from /usr/lib/libcuda.so.1
#2  0x00007f397427baec in ?? () from /usr/lib/libcuda.so.1
#3  0x00007f39741a9840 in ?? () from /usr/lib/libcuda.so.1
#4  0x00007f39741add08 in ?? () from /usr/lib/libcuda.so.1
#5  0x00007f3974194cea in ?? () from /usr/lib/libcuda.so.1
#6  0x00007f3974cd70aa in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#7  0x00007f3974cff6dd in cudaMemcpy () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#8  0x000000000046bf26 in thrust::detail::backend::cuda::detail::checked_cudaMemcpy(void*

As you can see, usually it ends up in __pthread_getspecific called from libcuda.so or somewhere in the library itself. As far as i remember there has been just one case where it did not crash but instead it hanged in a strange way: the program was able to respond to my requests if they did not involve any GPU computation (statistics etc.), but otherwise i never got a reply. Also, doing nvidia-smi -L did not work, it just hung there until i rebooted the computer. Looked to me like a GPU deadlock sort of. This might be a completely different issue than this one though.

Does anyone have a clue where the problem might be or what could cause this?

More information:

I have tried running on fewer cards (3, as that is the minimum needed for the program) and the crash still occurs.

The above is not true, i misconfigured the application and it used all four cards. Re-running the experiments with really just 3 cards seems to resolve the problem, it is now running for several hours under heavy load without crashes. I will now try to let it run a bit more and maybe then attempt to use a different subset of 3 cards to verify this and at the same time test if the problem is related to one particular card or not.

I monitored GPU temperature during the test runs and there does not seem to be anything wrong. The cards get up to about 78-80 °C under highest load with fan going at about 56% and this stays until the crash happens (several minutes), does not seem to be too high to me.

One thing i have been thinking about is the way the requests are handled - there is quite a lot of cudaSetDevice calls, since each request spawns a new thread (i’m using mongoose library) and then this thread switches between cards by calling cudaSetDevice(id) with appropriate device id. The switching can happen multiple times during one request and i am not using any streams (so it all goes to the default (0) stream IIRC). Can this somehow be related to the crashes occuring in pthread_getspecific ?

I have also tried upgrading to the latest beta drivers (319.12), but that didn’t help.