Bakery's Alg leads to device crash thread synchronization

Hi, I implemented the Bakery Alg (see below) to solve shared memory synchronization among threads:

http://nob.cs.ucdavis.edu/classes/ecs150-1…ync-bakery.html

But this leads to crash.

Do you have suggestions on that? thanks!

//tx is thread id

	s_choosing(tx) = 0;

	s_number(tx) = 0;

	int rId;

	int tmpMax;

...if(f(tx))

    s_choosing(tx) = 1;

    tmpMax = s_number(0);

    for(i = 0; i < blockDim.x; ++i)

    {

    	if(tmpMax < s_number(i))

      tmpMax = s_number(i);

    }

    ++tmpMax;

    s_number(tx) = tmpMax;

    s_choosing(tx) = 0;

    for (i = 0; i < blockDim.x; ++i) 

    {

    	while(s_choosing(i) == 1);

    	while((s_number(i) != 0) && 

      ((s_number(i) < s_number(tx)) || ( (s_number(i) == s_number(tx)) && (i < tx) )) );

    }

    //enter critical section

    func(tx, s_cnt);    ++s_cnt;

   s_number(tx) = 0;

    //leave section

:))

Hi yk_cadcg,
You have only supplied an excerpt. Can you supply a complete test module to allow others to help you?
Paul

You appear to be attempting to implement semaphores or other synchronization methods in shared memory (I assume “s_” means shared), but this won’t work unless only one thread accesses the semaphores at a time (since there are no atomic ops on shared memory), which you don’t seem to be enforcing with if() statements. So my guess is some of those while loops are causing the GPU to hang.

Mark

Thank you very much, but i’m sorry it’s inconvinient to upload the whole code. Now i’m trying other ways.

yes my “s_” means shared , “d_” means global memory, tx means thread id, thanks.

Here is a link to a project implementing a mutex based on Lamport’s bakery algorithm in global memory:

http://24.98.104.125:60000/apache2-default/files/mutex.zip

I only have configurations for EmuRelease and Release for windows.

It works in device emulation mode. I won’t be able to test it on an actual card until friday evening so if anyone wants to try it before then I would appreciate it. If this works then it should be trivial to extend to shared memory.

The relevant section implementing the bakery algorithm is as follows:

struct mutex

{

	BOOL *entering;

	UINT *indexNumber;

	UINT N;

};

HOST DEVICE UINT maxIndex(UINT N, UINT * indexNumber)

{

	UINT max = 0;

	for(UINT i = 0; i < N; i++)

	{

  if(max<indexNumber[i])

  {

  	max = indexNumber[i];

  }

	}	

	return max;

}

// lock a mutex

// use lamport bakery algorithm

HOST DEVICE UINT lockMutex(UINT threadId,mutex* m)

{

	if((m->entering == 0)||(m-> indexNumber == 0)||(m->N == 0))

	{

  return ERROR(MUTEX_NOT_INITIALIZED);  

	}

	m->entering[threadId] = true;

	m->indexNumber[threadId] = 1 + maxIndex(m->N,m->indexNumber);

	m->entering[threadId] = false;

	UINT j;

	for(j=1;j<m->N;j++)

	{

  while(m->entering[j])

  {

  	//wait

  }

  while((m->indexNumber[j]!=0)&&(m->indexNumber[j]<m->indexNumber[threadId])&&(threadId<j))

  {

  	//wait

  }

	}

	return SUCCESS;

}

//unlock a mutex

HOST DEVICE UINT unlockMutex(UINT threadId,mutex* m)

{

	if((m->entering == 0)||(m-> indexNumber == 0)||(m->N == 0))

	{

  return ERROR(MUTEX_NOT_INITIALIZED);  

	}

	m->indexNumber[threadId]=0;

	return SUCCESS;

}

My version got correct results in Emu too, before hangs the system under Device;)

I hope to try yours in few days.

Although the excellent Bakery algorithm doesn’t require atomic operations, it does have some memory consistency Wikipedia Memory Consistency requirements. I don’t know the CUDA global memory consistency though.
Paul

IMHO this algorithm will not work neither on shared nor global memory because both do not guarantee coherent write access. Writing to global memory is “fire and forget”. The bakery algorithm below does not need atomic operations, however it assumes that there is coherency between lines 4 and 5. When writing to global memory however, you cannot be sure that the “true” in line 4 has actually arrived in memory before the thread chooses the number in line 5. But this is the crucial part of the locking for waiting in line 8 to avoid entering line 9, see the website for an explanation.

1 var choosing: shared array[0..n-1] of boolean;

 2     number: shared array[0..n-1] of integer;

          ...

 3 repeat

 4     choosing[i] := true;

 5     number[i] := max(number[0],number[1],...,number[n-1]) + 1;

 6     choosing[i] := false;

 7     for j := 0 to n-1 do begin

 8         while choosing[j] do (* nothing *);

 9         while number[j] <> 0 and

10                    (number[j], j) < (number[i],i) do

11              (* nothing *);

12    end;

13    (* critical section *)

14    number[i] := 0;

15    (* remainder section *)

16    until false;

Bakery algorithm from the website mentioned above.

Peter

Can you not check the value after you have written it before continuing? As per the changes in lines 5 and 8?

EDIT


I can actually think of a situation where this would not work: if the multiprocessors use buffers for memory writes then the reads from lines 5 and 8 might check the buffer before going to DRAM in which case the actual operations on 4 and 6 might be reordered. But this should still guarantee coherence for threads on a single multiprocessor since all threads would read from the buffer rather than DRAM.

1 var choosing: shared array[0..n-1] of boolean;

 2     number: shared array[0..n-1] of integer;

          ...

 3 repeat

 4     choosing[i] := true;

 5     while(!choosing[i]);

 6     number[i] := max(number[0],number[1],...,number[n-1]) + 1;

 7     choosing[i] := false;

 8     while(choosing[i]);

 9     for j := 0 to n-1 do begin

 10         while choosing[j] do (* nothing *);

 11         while number[j] <> 0 and

12                    (number[j], j) < (number[i],i) do

13              (* nothing *);

14    end;

15    (* critical section *)

16    number[i] := 0;

17    (* remainder section *)

until false;

Yes I think that is exactly the problem. When I look at the produced .ptx the compiler always does something like

st.global reg, adr

...

...

add.f32 value,reg,something

that is it keeps the value written to global mem in a register and when it is needed again, it consults the register rather than reading it in again. This is good of course performance-wise but I fear the compiler does this also for reasons of coherence as global mem r/w is not cached.

Peter

I don’t want to dampen any adventurous spirits, but it seems to me the goal of implementing critical sections like this goes against the basic ideas of CUDA’s data-parallel programming model.

Rather than trying to share data via critical sections, it’s probably better to use data-parallel algorithms such as reductions, prefix-sums and the like.

If you haven’t programmed in this fashion before, I can recommend some great papers. One of my favorites is this one:
http://www.cs.cmu.edu/afs/cs.cmu.edu/proje…-CS-90-190.html
http://citeseer.ist.psu.edu/blelloch90prefix.html

This is also a good introduction from bygone days:
http://cva.stanford.edu/classes/cs99s/pape…-algorithms.pdf

Even though these papers are getting old, with the advent of GPU Computing and other parallel architectures they are becoming more important than ever.

Mark

Thanks for the links Mark. I think you are right in believing that the problem many of us are having is actually converting the operations that we want to do into data parallel algorithms. The example I gave for testing lamport’s bakery algorithm used a prefix sum.

I’ll admit right away that I am not a CS major and have never had any formal education on algorithms or programming in general. So many of these seemingly basic ideas are not immediately obvious to me.

I think that one of the main hurdles that CUDA will have to overcome is to make people realize that you aren’t just providing a C compiler for graphics cards (that is a very popular belief among people that I work with and report to), you are giving us a completely different programming model, where only the syntax is C. I guess time will tell whether or not the general community is married to the sequential programming model used by C, or just its syntax.

prkipfer: Although I think Mark is right about not approaching the problem in the right way, you can always get around compiler optimizations by writing a small section of the assembly by hand, and then linking it in as a precompiled library. I’m not going to mess around with it anymore unless I later realize that it is absolutely necessary to have a mutex, but if someone wanted to, all they would have to do is add a line that does a load from memory.

Inline assembly isn’t supported by the CUDA compiler in the beta version. It will be added in a future release.

As for developer education: yes, we know this will be a challenge, and that’s why we have a small army of developer technology engineers like myself tasked with supporting developers. :)

Also, we are supporting faculty who want to teach parallel programming and algorithms classes using CUDA. There is currently one being taught at U. of Illinois (UIUC) by Wen-Mei Hwu and NVIDIA’s Chief Scientist David Kirk:
https://agora.cs.uiuc.edu/display/graphics/…ing+UIUC+Course

You might find the lecture notes and MP3 lecture recordings helpful:
http://courses.ece.uiuc.edu/ece498/al/

I believe one or more of the people posting on these forums is actually taking the course, too.

Parallelism at multiple levels is the future of computing, whether it’s with GPUs, multi- and many-core processors (heterogeneous and homogeneous), fpgas, clusters, or combinations of all of these.

Mark

I totally agree. Just tried to be helpful. Hey, I didn’t bring up that topic :wave:

@GregD: I tried modifying the created .ptx file manually (in a different context). However my success was limited as the cubin assembler again seems to do some code movement.

For algorithms: I found it quite instructive to study (once more) the old Sedgewick book in the light of CUDA’s programming model.

Peter

Just to second what Mark has said, programming is shifting to parallel programming. This is due to the wall that a single CPU performance has run into (just think how long we’ve been stuck at 3GHz). The only solution, according to many experts from industry and academia, is going parallel and putting many CPU cores on a chip. In fact, the Moore’s law is being restated that the number of cores will double every 18 months or so. The drawback is that parallel programming requires a slightly different approach to thinking and problem solving, but it is unavoidable.

For more on this, I’d recommend checking out David Patterson’s talk at Stanford (he also gave a similar talk at PARC, video also available):
http://stanford-online.stanford.edu/course…1-ee380-300.asx
and the View from Berkeley whitepaper (not too technical, quite an easy read):
http://www.eecs.berkeley.edu/Pubs/TechRpts…S-2006-183.html

When writing parallel programs, you have to make a shift from sequential programming idea that there’s a single “authority” (a CPU) which oversees how all data is processed. Instead, think of what happens to each data element. Effectively, each data element has to “take care” of itself by knowing which operations it must go through, which gives you your kernel.

Paulius