Problem: Segmentation fault (SIGSEGV, signal 11)
Brief program description:
- high performance gpu (CUDA) server handling requests from remote
clients - each incoming request spawns a thread that performs
calculations on multiple GPU’s (serial, not in parallel) and sends
back a result to the client, this usually takes anywhere between 10-200ms as each request consists of tens or hundreds of kernel calls - request handler threads have exclusive access to GPU’s, meaning that if one thread is running something on GPU1 all others will have to wait until its done
- compiled with -arch=sm_35 -code=compute_35
- using CUDA 5.0
- i’m not using any CUDA atomics explicitly or any in-kernel synchronization barriers, though i’m using thrust (various functions) and cudaDeviceSynchronize() obviously
- Nvidia driver: NVIDIA dlloader X Driver 313.30 Wed Mar 27 15:33:21 PDT 2013
OS and HW info:
- Linux lub1 3.5.0-23-generic #35~precise1-Ubuntu x86_64 x86_64 x86_64 GNU/Linux
- GPU’s: 4x GPU 0: GeForce GTX TITAN
- 32 GB RAM
- MB: ASUS MAXIMUS V EXTREME
- CPU: i7-3770K
Crash information:
Crash occurs “randomly” after a couple of thousands requests are handled (sometimes sooner, sometimes later). Stack traces from some of the crashes look like this:
#0 0x00007f8a5b18fd91 in __pthread_getspecific (key=4) at pthread_getspecific.c:62
#1 0x00007f8a5a0c0cf3 in ?? () from /usr/lib/libcuda.so.1
#2 0x00007f8a59ff7b30 in ?? () from /usr/lib/libcuda.so.1
#3 0x00007f8a59fcc34a in ?? () from /usr/lib/libcuda.so.1
#4 0x00007f8a5ab253e7 in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#5 0x00007f8a5ab484fa in cudaGetDevice () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#6 0x000000000046c2a6 in thrust::detail::backend::cuda::arch::device_properties() ()
#0 0x00007ff03ba35d91 in __pthread_getspecific (key=4) at pthread_getspecific.c:62
#1 0x00007ff03a966cf3 in ?? () from /usr/lib/libcuda.so.1
#2 0x00007ff03aa24f8b in ?? () from /usr/lib/libcuda.so.1
#3 0x00007ff03b3e411c in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#4 0x00007ff03b3dd4b3 in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#5 0x00007ff03b3d18e0 in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#6 0x00007ff03b3fc4d9 in cudaMemset () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#7 0x0000000000448177 in libgbase::cudaGenericDatabase::cudaCountIndividual(unsigned int, ...
#0 0x00007f01db6d6153 in ?? () from /usr/lib/libcuda.so.1
#1 0x00007f01db6db7e4 in ?? () from /usr/lib/libcuda.so.1
#2 0x00007f01db6dbc30 in ?? () from /usr/lib/libcuda.so.1
#3 0x00007f01db6dbec2 in ?? () from /usr/lib/libcuda.so.1
#4 0x00007f01db6c6c58 in ?? () from /usr/lib/libcuda.so.1
#5 0x00007f01db6c7b49 in ?? () from /usr/lib/libcuda.so.1
#6 0x00007f01db6bdc22 in ?? () from /usr/lib/libcuda.so.1
#7 0x00007f01db5f0df7 in ?? () from /usr/lib/libcuda.so.1
#8 0x00007f01db5f4e0d in ?? () from /usr/lib/libcuda.so.1
#9 0x00007f01db5dbcea in ?? () from /usr/lib/libcuda.so.1
#10 0x00007f01dc11e0aa in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#11 0x00007f01dc1466dd in cudaMemcpy () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#12 0x0000000000472373 in thrust::detail::backend::cuda::detail::b40c_thrust::BaseRadixSortingEnactor
#0 0x00007f397533dd91 in __pthread_getspecific (key=4) at pthread_getspecific.c:62
#1 0x00007f397426ecf3 in ?? () from /usr/lib/libcuda.so.1
#2 0x00007f397427baec in ?? () from /usr/lib/libcuda.so.1
#3 0x00007f39741a9840 in ?? () from /usr/lib/libcuda.so.1
#4 0x00007f39741add08 in ?? () from /usr/lib/libcuda.so.1
#5 0x00007f3974194cea in ?? () from /usr/lib/libcuda.so.1
#6 0x00007f3974cd70aa in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#7 0x00007f3974cff6dd in cudaMemcpy () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#8 0x000000000046bf26 in thrust::detail::backend::cuda::detail::checked_cudaMemcpy(void*
As you can see, usually it ends up in __pthread_getspecific
called from libcuda.so
or somewhere in the library itself. As far as i remember there has been just one case where it did not crash but instead it hanged in a strange way: the program was able to respond to my requests if they did not involve any GPU computation (statistics etc.), but otherwise i never got a reply. Also, doing nvidia-smi -L did not work, it just hung there until i rebooted the computer. Looked to me like a GPU deadlock sort of. This might be a completely different issue than this one though.
Does anyone have a clue where the problem might be or what could cause this?