Why this kernel hangs?

Hi everybody,
can anyone tell me, why computer hangs up whith this code?
Be carefull with testing, I need hard reset on my GeForce 8400M GS + Ubuntu 8.04,
compiling simply nvcc -arch=sm_11 green.cu
What feature I’ve met?

Regards,
Sergiy

device int volatile green=0;
global void kernel(void){
while(green != threadIdx.x);
//critical section
++green; // green light for next thread;
}

int main(void){
kernel<<<1,2>>>();
}

Not sure why it hangs, but I am fairly sure the second thread will never execute.

First thread goes throw code,

global variable “green” becomes one,

the second thread should execute.

Where is my misunderstanding?

Regards,

 Sergiy

First let me say I’m just a newbie. Now, I see a major problem when all the threads try to access the same variable at once (since its volatile). Say you have 1000 threads trying all to read the same variable. There’s no guarantee that thread 0 will even get a chance to read it, because each other thread will try to read it and once a thread successfully reads the variable he will just try to read it again and again and again … I think I made my point.

Not sure why it hangs, but this seems to fix it:

__device__ int green=0;

__global__ void hangkernel(void){

	while (green != threadIdx.x);

	//critical section

	atomicAdd(&green, 1);

}

It works… But what does it mean?

  1. Volatile not works, read access is buffered?

  2. Write is blocked while read is active?

  3. Atomic has side effect for warp recalculation?

  4. …?

It’s architectural question, code example is derived from very complex algorithm.

I cannot find any document about GPU memory organization.

CUDA IS THE CLOSEST ARCHITECTURE I EVER SEEN.

SUITABLE ONLY FOR VECTOR SUM, NOT MORE.

Regards,

Sergiy

One should know what he does and how things work before blaiming 3rd party or other tools.

The global add probably solved the problem because you caused some kind of deadlock/race condition etc…

I come from the Online transaction field in the past… its so common to hear young developers, unexperienced ones

or simply some that love to rant that for every bug they have blaimed Oracle Database… come on… you must be more

serious.

As another comment, just take a look at the CUDA home page to see success stories in all sorts of fields, surely

most of them are more then “ONLU FOR VECTOR SUM, NOT MORE”

my 1 cent

eyal

For sure, with some patience und reading the programming guide you can solve many problems and understand problem specific behaviours, but Goo is right, when he describes CUDA as very close. Sometimes it is really frustrating to lose time because of simple problems, which can be avoided with a better documentation.

Please give an example, because as far as I am concerned, the documentation is quite good.

What is different is that parallel programming is not always easy, and people then quickly blame CUDA for being difficult. And then when you complain about it, it is easier to say that CUDA documentation is bad than to say that you find parallel programming difficult ;) (been there, done that :P)

At your point of view - more than thousand posts in a cuda community - this gpu topics are self explaining, but ask a new developer with some experience in
parallel programming, how to copy global device memory to constant memory directly. With a better documentation this question can be answered in seconds.

edit:
Well am I right, when I say, that nobody really answered the question asked in this topic?
IMHO the best anwser contains words like “propably” …
We try to discuss a simple question and should give a simple answer.
Even if you know, how atomic functions work, you do not know what really happens without.

Please, believe me, I want to find an answer, not just to make a flame.
I made a lot of tests around initialization and optimization - the same effect.
Changing global memory to shared - the same effect.
C compiler generate correct volatile instructions to PTX, but it’s hard to check ptxas, because he knows
machine codes while third-party disassemblers actually not.
I suppose, the write to memory is always defered. And only __syncthreads() commits changes.
Volatile modifier, (comes whis 1.1), just don’t works.
Is it true?
Nvidia staff, only you may know ;)

Regards,
Sergiy

I have a theory as to why the deadlock happens, and it has to do with divergent warps.

But first, I need to mention that I made a mistake with my previous example with atomicAdd(). While investigating, I discovered the reason it doesn’t lock up is that I had to change green to be not volatile, because atomicAdd() does not accept volatile arguments. And by making it not volatile, it totally messed up the functionality, so it would not actually protect the critical section at all. Oops!

Ok. Now for the lock-up problem.

Consider:

if (condition) {

	a();

}

else {

	b();

}

When branching, the processor has a choice between executing the true condition first, or the false condition first, or executing them both and time-slicing between the divergent branches. My thinking, based on the behavior is that it does not time slice between them, but instead executes them sequentially. When looking at the decuda output, it has “join” points which from the looks of it cause one branch to end, transferring control to the other branch so it can catch up.

So for example, the assembly code looks like this:

  1. calculate predicate for condition

  2. predicated branch to TrueLabel (begin divergence)

  3. do stuff for b()

  4. goto JoinLabel

  5. TrueLabel:

  6. do stuff for a()

  7. JoinLabel:

  8. join

When calculating divergent code, the processor will go through steps 1, and 2, and will start computing 5, knowing that it has to come back to 3 for those threads for which the condition was false. It will go to steps 6, 7, and 8, and when it reaches the join, it backtracks to 3, then 4, then 8 again. Since all the threads have reached 8, execution is no longer divergent and processing continues.

But what if “do stuff for a()” is in an infinite loop waiting for “do stuff for b()”? It’s a deadlock type of situation!

Now consider this:

  1. load “green” from global memory

  2. calculate predicate for green != threadIdx.x

  3. predicated branch to step 1

  4. add 1 to value of green

  5. store to “green” in global memory

  6. join

My theory is that execution goes step 1, 2, 3, and then the “true” fork gets executed, and the “false” fork gets deferred until the true fork finishes (reaches the join). The true fork loops back to 1, and gets stuck in the loop 1, 2, 3 forever. It’s deadlocked waiting for the false fork, and the false fork is waiting for the true fork to reach a join.

Hypothesizing that the true fork gets executed first, I tried this, and it does not hang:

__device__ int volatile green=0;

__global__ void hangkernel(void){

	int executed = 0;

	while (!executed) {

		if (green == threadIdx.x) {

			//critical section

			++green;

			executed = 1;

		}

	}

}

The lesson to learn from this is that mutexes within a warp are very dangerous and can deadlock easily.

I always understood why multi-block barriers are hazardous, and now I understand why tmurray said that mutexes were dangerous too! It is certainly not a good idea to depend on the order of the divergence because it can change easily depending on how the compiler decides to implement the logic. It can lead to silent problems that lurk until mutual exclusion is needed, and then hangs the program with deadlock. Ouch.

Thank You very much, Jamie.

Looks like true.

(It’s pleasure to see professional message).

I have a lot of hanging kernels, all of them use branching.

It’s hard to imagine, that it’s always so dangerous.

Deviceemu knows nothing about it.

I’ll keep testing.

There’s no mechanic for dymamic thread invocation, and it’s not so simple to have “thread pool” with running

threads in hot reserve.

Thanks again.

Regards,

Sergiy

p.s. … “true” fork should finish in all threads! Oh, Got…

p.p.s. Nvidia, it must be documented. At least, “join” technology.

Btw, Jamie, why we haven’t deadlock with atomic?..

Maybe, deferred memory write can’t be dropped away so quickly?

edit: sorry, the following hangs. All right )

__device__ int volatile green = 0;

__global__ void hangkernel(void){

while (green!= threadIdx.x);

atomicAdd((int*)&green,1);

}

cool,man

it helps a lot :rolleyes:

have u tried to invoke a PV, wait and signal?