Hi,
I have a system in production which runs for a lot of time without any problem using Tesla S1070.
Recently we’ve built another environment with Fermi’s S2050 and from time to time the code hangs after processing for a few hours.
Toolkit is 3.1 and other environment parameters should be the same.
If I gdb to the process I get this (there are a lot of threads - 2 per each GPU - I have 2 S2050 per machine so its 16 threads
just to manage the GPUs/CPUs]) - Thread 16 seems to be the problematic one:
(gdb) info threads
19 Thread 0x41a54940 (LWP 32431) 0x00000036d260ae00 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
18 Thread 0x429cd940 (LWP 32432) 0x00000036d1e9a0b1 in nanosleep () from /lib64/libc.so.6
17 Thread 0x46fd4940 (LWP 16549) 0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
16 Thread 0x40b75940 (LWP 16550) 0x00000036d1eba937 in sched_yield () from /lib64/libc.so.6
15 Thread 0x465d3940 (LWP 16551) 0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
14 Thread 0x433ce940 (LWP 16552) 0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
13 Thread 0x43dcf940 (LWP 16553) 0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
12 Thread 0x447d0940 (LWP 16554) 0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
11 Thread 0x451d1940 (LWP 16555) 0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
10 Thread 0x45bd2940 (LWP 16556) 0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
9 Thread 0x479d5940 (LWP 16567) 0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
8 Thread 0x483d6940 (LWP 16568) 0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
7 Thread 0x48dd7940 (LWP 16569) 0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
6 Thread 0x497d8940 (LWP 16570) 0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
5 Thread 0x4a1d9940 (LWP 16571) 0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
4 Thread 0x4abda940 (LWP 16572) 0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
3 Thread 0x4b5db940 (LWP 16573) 0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
2 Thread 0x4bfdc940 (LWP 16574) 0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) thread 16
[Switching to thread 16 (Thread 0x40b75940 (LWP 16550))]#0 0x00000036d1eba937 in sched_yield () from /lib64/libc.so.6
(gdb) backtrace
#0 0x00000036d1eba937 in sched_yield () from /lib64/libc.so.6
#1 0x00002afae0bfc6e5 in ?? () from /usr/lib64/libcuda.so.1
#2 0x00002afae0bfbf12 in ?? () from /usr/lib64/libcuda.so.1
#3 0x00002afae0bfc536 in ?? () from /usr/lib64/libcuda.so.1
#4 0x00002afae0bd6680 in ?? () from /usr/lib64/libcuda.so.1
#5 0x00002afae0c607d7 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00002afae07156e6 in cudaThreadSynchronize () from /usr/local/cuda/lib64/libcudart.so.3
#7 0x00002afadfe8f561 in CalculateSearchOnGPU () from /home/run/lib64/libMyCodeGNU64.so //-> my method
Seems like cudaThreadSynchronize hangs or doesnt exit after the kernel run, which seemed to have finished fine.
Any thoughts/ideas?
thanks
eyal