Code hangs...

Hi,

I have a system in production which runs for a lot of time without any problem using Tesla S1070.

Recently we’ve built another environment with Fermi’s S2050 and from time to time the code hangs after processing for a few hours.

Toolkit is 3.1 and other environment parameters should be the same.

If I gdb to the process I get this (there are a lot of threads - 2 per each GPU - I have 2 S2050 per machine so its 16 threads

just to manage the GPUs/CPUs]) - Thread 16 seems to be the problematic one:

(gdb) info threads

  19 Thread 0x41a54940 (LWP 32431)  0x00000036d260ae00 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  18 Thread 0x429cd940 (LWP 32432)  0x00000036d1e9a0b1 in nanosleep () from /lib64/libc.so.6

  17 Thread 0x46fd4940 (LWP 16549)  0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  16 Thread 0x40b75940 (LWP 16550)  0x00000036d1eba937 in sched_yield () from /lib64/libc.so.6

  15 Thread 0x465d3940 (LWP 16551)  0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  14 Thread 0x433ce940 (LWP 16552)  0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  13 Thread 0x43dcf940 (LWP 16553)  0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  12 Thread 0x447d0940 (LWP 16554)  0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  11 Thread 0x451d1940 (LWP 16555)  0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  10 Thread 0x45bd2940 (LWP 16556)  0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  9 Thread 0x479d5940 (LWP 16567)  0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  8 Thread 0x483d6940 (LWP 16568)  0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  7 Thread 0x48dd7940 (LWP 16569)  0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  6 Thread 0x497d8940 (LWP 16570)  0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  5 Thread 0x4a1d9940 (LWP 16571)  0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  4 Thread 0x4abda940 (LWP 16572)  0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  3 Thread 0x4b5db940 (LWP 16573)  0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  2 Thread 0x4bfdc940 (LWP 16574)  0x00000036d260ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

(gdb) thread 16

[Switching to thread 16 (Thread 0x40b75940 (LWP 16550))]#0  0x00000036d1eba937 in sched_yield () from /lib64/libc.so.6

(gdb) backtrace

#0  0x00000036d1eba937 in sched_yield () from /lib64/libc.so.6

#1  0x00002afae0bfc6e5 in ?? () from /usr/lib64/libcuda.so.1

#2  0x00002afae0bfbf12 in ?? () from /usr/lib64/libcuda.so.1

#3  0x00002afae0bfc536 in ?? () from /usr/lib64/libcuda.so.1

#4  0x00002afae0bd6680 in ?? () from /usr/lib64/libcuda.so.1

#5  0x00002afae0c607d7 in ?? () from /usr/lib64/libcuda.so.1

#6  0x00002afae07156e6 in cudaThreadSynchronize () from /usr/local/cuda/lib64/libcudart.so.3

#7  0x00002afadfe8f561 in CalculateSearchOnGPU () from /home/run/lib64/libMyCodeGNU64.so  //-> my method

Seems like cudaThreadSynchronize hangs or doesnt exit after the kernel run, which seemed to have finished fine.

Any thoughts/ideas?

thanks

eyal

What makes you think the kernel finished fine? Maybe it’s still computing (or hung in a loop because of a kernel bug). That’s a more likely explanation than a hostside issue (though that certainly is still possible.)

You could change your cudaThreadSynchronize() to a polling method to help you debug this… you could actually print if and when you receive the “kernel finished” event and not depend on the explicit but black box cudaThreadSynchronize().

You say you have 16 threads, so the event/polling method could also burn a lot less CPU if your kernels are any slower than a few milliseconds.

I often run with 3x GTX295s, therefore 6 threads… a polling loop changed my hexacore CPU load from 600% to 5%.

What makes you think the kernel finished fine? Maybe it’s still computing (or hung in a loop because of a kernel bug). That’s a more likely explanation than a hostside issue (though that certainly is still possible.)

You could change your cudaThreadSynchronize() to a polling method to help you debug this… you could actually print if and when you receive the “kernel finished” event and not depend on the explicit but black box cudaThreadSynchronize().

You say you have 16 threads, so the event/polling method could also burn a lot less CPU if your kernels are any slower than a few milliseconds.

I often run with 3x GTX295s, therefore 6 threads… a polling loop changed my hexacore CPU load from 600% to 5%.

Thats obviously a posibility… it however works fine on a Tesla cluster and the same data/work runs well if I kill the app and restart it.

I have a cuda event to time the kernels and they complete fine… i see the lock happens after the cudaEvent function has terminated and

cudaThreadSync is called.

Our machines always use 8 GPUs, and on the Tesla cluster it doesnt happen, only the Fermi.

the CPU load doesn’t concern me too much since the kernels can get from a few Milliseconds to a few seconds.

How would you implement an efficient poll loop so that you dont waste time while the GPU is idle?

thanks

eyal

Thats obviously a posibility… it however works fine on a Tesla cluster and the same data/work runs well if I kill the app and restart it.

I have a cuda event to time the kernels and they complete fine… i see the lock happens after the cudaEvent function has terminated and

cudaThreadSync is called.

Our machines always use 8 GPUs, and on the Tesla cluster it doesnt happen, only the Fermi.

the CPU load doesn’t concern me too much since the kernels can get from a few Milliseconds to a few seconds.

How would you implement an efficient poll loop so that you dont waste time while the GPU is idle?

thanks

eyal

The fact that other GPUs or re-running sometimes makes it work doesn’t mean the kernel is fine… it could just be a race inside the kernel that works most of the time but not always. Have you run it in Ocelot to look for races? (That’s no guarantee, but it’s an easy check!)

Calling cudaEventRecord() just queues up the event asynchronously. Its success doesn’t mean the kernel has completed. You need to use the results of cudaEventQuery() to test the queue status.

It’s so simple you may as well try it. Here’s roughly all I do, which works great.

myKernel<<<a,b>>>(myArgs);

cudaEventRecord(event, 0);

do msleep(50);  /* some number of milliseconds significantly shorter than your expected kernel run time */

while (cudaErrNotReady==cudaEventQuery(event));

If you’re in Windows, then Sleep() is the right way to hibernate a thread for a given number of milliseconds.

The fact that other GPUs or re-running sometimes makes it work doesn’t mean the kernel is fine… it could just be a race inside the kernel that works most of the time but not always. Have you run it in Ocelot to look for races? (That’s no guarantee, but it’s an easy check!)

Calling cudaEventRecord() just queues up the event asynchronously. Its success doesn’t mean the kernel has completed. You need to use the results of cudaEventQuery() to test the queue status.

It’s so simple you may as well try it. Here’s roughly all I do, which works great.

myKernel<<<a,b>>>(myArgs);

cudaEventRecord(event, 0);

do msleep(50);  /* some number of milliseconds significantly shorter than your expected kernel run time */

while (cudaErrNotReady==cudaEventQuery(event));

If you’re in Windows, then Sleep() is the right way to hibernate a thread for a given number of milliseconds.

First of all - you’re 1000% correct :) I looked again at the code and here is a more detailed description of what the code looks like.

Before each kernel call (and mostly before any cudaMemcpy et al…) I do this:

cudaEventRecord( iTimer, 0 );

... // some regular host preparation code....

CallKernelXXX<<< >>>( ... );

cudaError_t err = cudaGetLastError();

if( cudaSuccess != err)  { printf ( "Error...%d", err ); exit( -1 ); }

err = cudaThreadSynchronize();

if( cudaSuccess != err)  { printf ( "Error...%d", err ); exit( -1 ); }

cudaEventRecord( iTimerStop, 0 );

cudaEventSynchronize( iTimerStop );

cudaEventElapsedTime( &fKernelTimer, iTimer, iTimerStop );

printf( "KernelXXX ended after:[%.3f ms]", fKernelTimer );

...

The thing is that whenever the code hangs I see the last timer message in the log file, so it seems that I have completed the kernel fine.

Then the code gets another data chunk to be processed (regular C++ CPU code) and maybe then it gets stuck before the next kernel or memcpy…

I still need to look at it in more depth… but it seems that the kernel has run fine.

I didn’t try Ocelot yet - one of the top 5 things to do on my list :)

Thats nice :). What about a very short kernels? 5-20ms ? won’t a msleep(5) cause to much overhead?

any chance implementing a poll like this can save on power (for kernels running 2-5 seconds)???

One downside of such a thing is that the CPU utilization will indeed drop to a few precent… my management will cut

down my bonuses since the production environment is “under-utilized and idle most of the time” … ;)

Many thanks,

eyal

EDIT:

Here’s another stack from another stucked session:

gdb) thread 6

[Switching to thread 6 (Thread 0x4502f940 (LWP 4512))]#0  0xffffffffff600096 in ?? ()

(gdb) backtrace

#0  0xffffffffff600096 in ?? ()

#1  0x000000004502d770 in ?? ()

#2  0x00000036d1e8bf9d in gettimeofday () from /lib64/libc.so.6

#3  0x00002b81dbe90932 in ?? () from /usr/lib64/libcuda.so.1

#4  0x00002b81db96f681 in ?? () from /usr/lib64/libcuda.so.1

#5  0x00002b81db96ef12 in ?? () from /usr/lib64/libcuda.so.1

#6  0x00002b81db96f536 in ?? () from /usr/lib64/libcuda.so.1

#7  0x00002b81db949680 in ?? () from /usr/lib64/libcuda.so.1

#8  0x00002b81db9d37d7 in ?? () from /usr/lib64/libcuda.so.1

#9  0x00002b81db4886e6 in cudaThreadSynchronize () from /usr/local/cuda/lib64/libcudart.so.3

#10 0x00002b81dac02561 in CalculateSearchOnGPU () from /home/run/lib64/libGGNU64.so

First of all - you’re 1000% correct :) I looked again at the code and here is a more detailed description of what the code looks like.

Before each kernel call (and mostly before any cudaMemcpy et al…) I do this:

cudaEventRecord( iTimer, 0 );

... // some regular host preparation code....

CallKernelXXX<<< >>>( ... );

cudaError_t err = cudaGetLastError();

if( cudaSuccess != err)  { printf ( "Error...%d", err ); exit( -1 ); }

err = cudaThreadSynchronize();

if( cudaSuccess != err)  { printf ( "Error...%d", err ); exit( -1 ); }

cudaEventRecord( iTimerStop, 0 );

cudaEventSynchronize( iTimerStop );

cudaEventElapsedTime( &fKernelTimer, iTimer, iTimerStop );

printf( "KernelXXX ended after:[%.3f ms]", fKernelTimer );

...

The thing is that whenever the code hangs I see the last timer message in the log file, so it seems that I have completed the kernel fine.

Then the code gets another data chunk to be processed (regular C++ CPU code) and maybe then it gets stuck before the next kernel or memcpy…

I still need to look at it in more depth… but it seems that the kernel has run fine.

I didn’t try Ocelot yet - one of the top 5 things to do on my list :)

Thats nice :). What about a very short kernels? 5-20ms ? won’t a msleep(5) cause to much overhead?

any chance implementing a poll like this can save on power (for kernels running 2-5 seconds)???

One downside of such a thing is that the CPU utilization will indeed drop to a few precent… my management will cut

down my bonuses since the production environment is “under-utilized and idle most of the time” … ;)

Many thanks,

eyal

EDIT:

Here’s another stack from another stucked session:

gdb) thread 6

[Switching to thread 6 (Thread 0x4502f940 (LWP 4512))]#0  0xffffffffff600096 in ?? ()

(gdb) backtrace

#0  0xffffffffff600096 in ?? ()

#1  0x000000004502d770 in ?? ()

#2  0x00000036d1e8bf9d in gettimeofday () from /lib64/libc.so.6

#3  0x00002b81dbe90932 in ?? () from /usr/lib64/libcuda.so.1

#4  0x00002b81db96f681 in ?? () from /usr/lib64/libcuda.so.1

#5  0x00002b81db96ef12 in ?? () from /usr/lib64/libcuda.so.1

#6  0x00002b81db96f536 in ?? () from /usr/lib64/libcuda.so.1

#7  0x00002b81db949680 in ?? () from /usr/lib64/libcuda.so.1

#8  0x00002b81db9d37d7 in ?? () from /usr/lib64/libcuda.so.1

#9  0x00002b81db4886e6 in cudaThreadSynchronize () from /usr/local/cuda/lib64/libcudart.so.3

#10 0x00002b81dac02561 in CalculateSearchOnGPU () from /home/run/lib64/libGGNU64.so

This is just tangential to your problem, but be very careful about relating the outputs from your printfs to actual asynchronous event sequences. There is a lot of buffering in the Linux IO subsystems and unless you are really pedantic about flushing, what you see and the order in which you see it in the output file can be misleading (I have gotten into the habit of time stamping diagnostic messages for this reason).

This is just tangential to your problem, but be very careful about relating the outputs from your printfs to actual asynchronous event sequences. There is a lot of buffering in the Linux IO subsystems and unless you are really pedantic about flushing, what you see and the order in which you see it in the output file can be misleading (I have gotten into the habit of time stamping diagnostic messages for this reason).

As far as I understand my code ( :) ), the timing of the kernel and then printf this time/message couldn’t have happened prior to the kernel

actually ending correctly. So i still don’t see how this helps me to identify whether its a software/toolkit/hardware problem…

thanks

eyal

As far as I understand my code ( :) ), the timing of the kernel and then printf this time/message couldn’t have happened prior to the kernel

actually ending correctly. So i still don’t see how this helps me to identify whether its a software/toolkit/hardware problem…

thanks

eyal

Obviously not (I did say it was tangential to your problem). But if I understand you correctly, you are trying to work out when in the sequence of events the host thread is getting stuck. It obviously is happening after the cudaEventSynchronize() call has returned (I note you are not checking the return status, though), but the question comes as to when after the call returns. Assuming these host threads are persistent, it could be considerably later than you imagine, because of buffering. There might also be messages you never see in the output because they are stuck in an dirty buffer somewhere.

Obviously not (I did say it was tangential to your problem). But if I understand you correctly, you are trying to work out when in the sequence of events the host thread is getting stuck. It obviously is happening after the cudaEventSynchronize() call has returned (I note you are not checking the return status, though), but the question comes as to when after the call returns. Assuming these host threads are persistent, it could be considerably later than you imagine, because of buffering. There might also be messages you never see in the output because they are stuck in an dirty buffer somewhere.

Yes this is exactly what I was trying to describe :)

Ok I see. The code is iterative so after the “last” CUDA call, the code should process the next chunk of data… I have some more cudaThreadSync there and more prints but I dont see it.

Can I map the addresses GDB gives me to actuall code lines? (this is not CUDA related obviously :) )

Also because it shows that it is stuck in the cudaThreadSynchronize and the stack is a bit weird, i thought

maybe the problem is in the call itself…

#0  0xffffffffff600096 in ?? ()

#1  0x000000004502d770 in ?? ()

....

#9  0x00002b81db4886e6 in cudaThreadSynchronize () from /usr/local/cuda/lib64/libcudart.so.3

#10 0x00002b81dac02561 in CalculateSearchOnGPU () from /home/run/lib64/libGGNU64.so

Yes this is exactly what I was trying to describe :)

Ok I see. The code is iterative so after the “last” CUDA call, the code should process the next chunk of data… I have some more cudaThreadSync there and more prints but I dont see it.

Can I map the addresses GDB gives me to actuall code lines? (this is not CUDA related obviously :) )

Also because it shows that it is stuck in the cudaThreadSynchronize and the stack is a bit weird, i thought

maybe the problem is in the call itself…

#0  0xffffffffff600096 in ?? ()

#1  0x000000004502d770 in ?? ()

....

#9  0x00002b81db4886e6 in cudaThreadSynchronize () from /usr/local/cuda/lib64/libcudart.so.3

#10 0x00002b81dac02561 in CalculateSearchOnGPU () from /home/run/lib64/libGGNU64.so

You can do some pointer arithmetic and work out an offset into the code. But you can also just recompile your code with debugging symbols on and you will get exact line numbers. Going further into the library calls requires disassembly of the cuda runtime library, which can also be done (whether it may be done depends on local copyright laws and how literally you are willing to interpret the EULA you may have agreed to. Or not…).

You can do some pointer arithmetic and work out an offset into the code. But you can also just recompile your code with debugging symbols on and you will get exact line numbers. Going further into the library calls requires disassembly of the cuda runtime library, which can also be done (whether it may be done depends on local copyright laws and how literally you are willing to interpret the EULA you may have agreed to. Or not…).

I’ll try to re-compile with debug when I can - this is supposedly a production system.

From past experience - doing this makes the probelm go away most of the times :)

I’ll also add time stamps and more debug print outs as you’ve suggested.

Many thanks Avid.

Eyal