Hi,
Today I encountered a very strange error. Here is the description of my program:
Hardware: Three GTX280, Xeon quad-core, 8Gb Ram.
Software: Three threads running on each card. The program launches 1000 jobs on each thread. Each job copy an image to it’s assigned gpu with a cudaMemcpy.
Observations: The first thread on the cpu works fine. But when the second thread wants to process its cudaMemcpy, most of the time, the application crashes with a segmentation fault error. Here is the gdb trace:
[codebox]Starting program: /data/home/smekens/workspace/sandbox/bin/mx4_prepare2 test.xconf desc
[Thread debugging using libthread_db enabled]
[New Thread 0x7fadd1ade6f0 (LWP 9729)]
[New Thread 0x41925950 (LWP 9730)]
[New Thread 0x42126950 (LWP 9731)]
[New Thread 0x42927950 (LWP 9732)]
[New Thread 0x43128950 (LWP 9733)]
[New Thread 0x43929950 (LWP 9734)]
[New Thread 0x4412a950 (LWP 9735)]
[New Thread 0x4492b950 (LWP 9736)]
Device = 0
Device = 1
Notice : (Kernel2::search) kernel_0 run task = IMG2DESC (0)
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x42927950 (LWP 9732)]
0x00007fadcfd7c6f5 in ?? () from /usr/lib/libcuda.so.1
(gdb) bt
#0 0x00007fadcfd7c6f5 in ?? () from /usr/lib/libcuda.so.1
#1 0x00007fadcfd69a8c in ?? () from /usr/lib/libcuda.so.1
#2 0x00007fadcfd615b8 in ?? () from /usr/lib/libcuda.so.1
#3 0x00007fadcfb288d4 in ?? () from /data/home/smekens/local/cuda/lib/libcudart.so.2
#4 0x00007fadcfb161a7 in cudaMemcpy () from /data/home/smekens/local/cuda/lib/libcudart.so.2
#5 0x000000000046f059 in pirix_cuda_img2dsc (obsidian=0x5e5e6c0, src=0x7082bf0, det_thresh=200, desc_T=0.00200000009) at cuda/pirix_cuda.cu:88
#6 0x000000000042f03a in pirix_img2dsc (pirix=0x2f6d520, src=0x7082bf0) at pirix.c:188
#7 0x000000000040b21b in mx4::Kernel2::runTask (this=0x2f68c20, type=, ptr=) at core/kernel2.cpp:183
#8 0x000000000040e1e0 in kernelMainThread (t=0x2f68c20) at core/dispatcher.cpp:24
#9 0x00007fadd01e4fc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007fadcec2c5ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()
(gdb)
[/codebox]
Should I set a mutex arount the cudaMemcpy to prevent this crach to happend ???
Have I forgotten to to something importent when I am using mutlithread application ???
I hope this problem will be fixed soon.