Why this kernel hangs?

Goo · April 3, 2009, 2:06pm

Hi everybody,
can anyone tell me, why computer hangs up whith this code?
Be carefull with testing, I need hard reset on my GeForce 8400M GS + Ubuntu 8.04,
compiling simply nvcc -arch=sm_11 green.cu
What feature I’ve met?

Regards,
Sergiy

device int volatile green=0;
global void kernel(void){
while(green != threadIdx.x);
//critical section
++green; // green light for next thread;
}

int main(void){
kernel<<<1,2>>>();
}

shifter1 · April 3, 2009, 2:16pm

Not sure why it hangs, but I am fairly sure the second thread will never execute.

Goo · April 3, 2009, 3:51pm

First thread goes throw code,

global variable “green” becomes one,

the second thread should execute.

Where is my misunderstanding?

Regards,

 Sergiy

Noel_Lopes · April 3, 2009, 8:32pm

First let me say I’m just a newbie. Now, I see a major problem when all the threads try to access the same variable at once (since its volatile). Say you have 1000 threads trying all to read the same variable. There’s no guarantee that thread 0 will even get a chance to read it, because each other thread will try to read it and once a thread successfully reads the variable he will just try to read it again and again and again … I think I made my point.

Jamie_K · April 4, 2009, 10:55am

Not sure why it hangs, but this seems to fix it:

__device__ int green=0;

__global__ void hangkernel(void){

	while (green != threadIdx.x);

	//critical section

	atomicAdd(&green, 1);

}

Goo · April 4, 2009, 1:18pm

It works… But what does it mean?

Volatile not works, read access is buffered?
Write is blocked while read is active?
Atomic has side effect for warp recalculation?
…?

It’s architectural question, code example is derived from very complex algorithm.

I cannot find any document about GPU memory organization.

CUDA IS THE CLOSEST ARCHITECTURE I EVER SEEN.

SUITABLE ONLY FOR VECTOR SUM, NOT MORE.

Regards,

Sergiy

eyalhir74 · April 5, 2009, 10:10am

One should know what he does and how things work before blaiming 3rd party or other tools.

The global add probably solved the problem because you caused some kind of deadlock/race condition etc…

I come from the Online transaction field in the past… its so common to hear young developers, unexperienced ones

or simply some that love to rant that for every bug they have blaimed Oracle Database… come on… you must be more

serious.

As another comment, just take a look at the CUDA home page to see success stories in all sorts of fields, surely

most of them are more then “ONLU FOR VECTOR SUM, NOT MORE”

my 1 cent

eyal

QD4_33 · April 5, 2009, 11:13am

For sure, with some patience und reading the programming guide you can solve many problems and understand problem specific behaviours, but Goo is right, when he describes CUDA as very close. Sometimes it is really frustrating to lose time because of simple problems, which can be avoided with a better documentation.

E.D_Riedijk · April 5, 2009, 12:37pm

Please give an example, because as far as I am concerned, the documentation is quite good.

What is different is that parallel programming is not always easy, and people then quickly blame CUDA for being difficult. And then when you complain about it, it is easier to say that CUDA documentation is bad than to say that you find parallel programming difficult ;) (been there, done that :P)

QD4_33 · April 5, 2009, 1:35pm

At your point of view - more than thousand posts in a cuda community - this gpu topics are self explaining, but ask a new developer with some experience in
parallel programming, how to copy global device memory to constant memory directly. With a better documentation this question can be answered in seconds.

edit:
Well am I right, when I say, that nobody really answered the question asked in this topic?
IMHO the best anwser contains words like “propably” …
We try to discuss a simple question and should give a simple answer.
Even if you know, how atomic functions work, you do not know what really happens without.

Goo · April 6, 2009, 5:15am

Please, believe me, I want to find an answer, not just to make a flame.
I made a lot of tests around initialization and optimization - the same effect.
Changing global memory to shared - the same effect.
C compiler generate correct volatile instructions to PTX, but it’s hard to check ptxas, because he knows
machine codes while third-party disassemblers actually not.
I suppose, the write to memory is always defered. And only __syncthreads() commits changes.
Volatile modifier, (comes whis 1.1), just don’t works.
Is it true?
Nvidia staff, only you may know ;)

Regards,
Sergiy

Jamie_K · April 6, 2009, 10:50am

I have a theory as to why the deadlock happens, and it has to do with divergent warps.

But first, I need to mention that I made a mistake with my previous example with atomicAdd(). While investigating, I discovered the reason it doesn’t lock up is that I had to change green to be not volatile, because atomicAdd() does not accept volatile arguments. And by making it not volatile, it totally messed up the functionality, so it would not actually protect the critical section at all. Oops!

Ok. Now for the lock-up problem.

Consider:

if (condition) {

	a();

}

else {

	b();

}

When branching, the processor has a choice between executing the true condition first, or the false condition first, or executing them both and time-slicing between the divergent branches. My thinking, based on the behavior is that it does not time slice between them, but instead executes them sequentially. When looking at the decuda output, it has “join” points which from the looks of it cause one branch to end, transferring control to the other branch so it can catch up.

So for example, the assembly code looks like this:

calculate predicate for condition
predicated branch to TrueLabel (begin divergence)
do stuff for b()
goto JoinLabel
TrueLabel:
do stuff for a()
JoinLabel:
join

When calculating divergent code, the processor will go through steps 1, and 2, and will start computing 5, knowing that it has to come back to 3 for those threads for which the condition was false. It will go to steps 6, 7, and 8, and when it reaches the join, it backtracks to 3, then 4, then 8 again. Since all the threads have reached 8, execution is no longer divergent and processing continues.

But what if “do stuff for a()” is in an infinite loop waiting for “do stuff for b()”? It’s a deadlock type of situation!

Now consider this:

load “green” from global memory
calculate predicate for green != threadIdx.x
predicated branch to step 1
add 1 to value of green
store to “green” in global memory
join

My theory is that execution goes step 1, 2, 3, and then the “true” fork gets executed, and the “false” fork gets deferred until the true fork finishes (reaches the join). The true fork loops back to 1, and gets stuck in the loop 1, 2, 3 forever. It’s deadlocked waiting for the false fork, and the false fork is waiting for the true fork to reach a join.

Hypothesizing that the true fork gets executed first, I tried this, and it does not hang:

__device__ int volatile green=0;

__global__ void hangkernel(void){

	int executed = 0;

	while (!executed) {

		if (green == threadIdx.x) {

			//critical section

			++green;

			executed = 1;

		}

	}

}

The lesson to learn from this is that mutexes within a warp are very dangerous and can deadlock easily.

I always understood why multi-block barriers are hazardous, and now I understand why tmurray said that mutexes were dangerous too! It is certainly not a good idea to depend on the order of the divergence because it can change easily depending on how the compiler decides to implement the logic. It can lead to silent problems that lurk until mutual exclusion is needed, and then hangs the program with deadlock. Ouch.

Goo · April 6, 2009, 2:30pm

Thank You very much, Jamie.

Looks like true.

(It’s pleasure to see professional message).

I have a lot of hanging kernels, all of them use branching.

It’s hard to imagine, that it’s always so dangerous.

Deviceemu knows nothing about it.

I’ll keep testing.

There’s no mechanic for dymamic thread invocation, and it’s not so simple to have “thread pool” with running

threads in hot reserve.

Thanks again.

Regards,

Sergiy

p.s. … “true” fork should finish in all threads! Oh, Got…

p.p.s. Nvidia, it must be documented. At least, “join” technology.

Goo · April 6, 2009, 3:45pm

Btw, Jamie, why we haven’t deadlock with atomic?..

Maybe, deferred memory write can’t be dropped away so quickly?

edit: sorry, the following hangs. All right )

__device__ int volatile green = 0;

__global__ void hangkernel(void){

while (green!= threadIdx.x);

atomicAdd((int*)&green,1);

}

flyfrog · October 13, 2009, 8:06am

Btw, Jamie, why we haven’t deadlock with atomic?..

Maybe, deferred memory write can’t be dropped away so quickly?

edit: sorry, the following hangs. All right )
__device__ int volatile green = 0;

__global__ void hangkernel(void){

while (green!= threadIdx.x);

atomicAdd((int*)&green,1);

}

cool,man

it helps a lot :rolleyes:

have u tried to invoke a PV, wait and signal?

Topic		Replies	Views
why this deadlocks? try to invoke a critical area CUDA Programming and Performance	11	6114	November 6, 2009
Deadlock in busy waiting queue CUDA Programming and Performance cuda	6	313	June 13, 2024
DeadLock on this piece of code. Require your help. CUDA Programming and Performance	13	4757	March 17, 2010
Mixing volatile writes and atomicAdd()s with shared memory produces unexpected results CUDA Programming and Performance	8	679	March 13, 2024
cudaLaunchKernel hang in function futex_abstimed_wait CUDA Programming and Performance	6	680	November 6, 2023
Understanding a spinlock implementation by Robert Crovella CUDA Programming and Performance	6	1679	September 26, 2023
inter-warp synchronization troubles with persistent threads (__threadfence_block() ?) CUDA Programming and Performance	6	2774	December 8, 2010
Implementing global lock (critical section) on 1-block cuda process CUDA Programming and Performance cuda , kernel	1	177	June 4, 2024
cudaLaunchKernel（） hang Linux	0	28	October 23, 2024
CUDA deadlock issues in emulation mode CUDA Programming and Performance	5	3705	June 9, 2009

Why this kernel hangs?

Related topics