atomic locks

Hi all,

The following program used the implementation of atomic locks from ‘Cuda By Example’, but running the program makes my machine frozen.

Can someone tell me what’s wrong with my program? Thanks a lot

Yifei

#include <stdio.h>

__global__ void test()

{

        __shared__ int i, mutex;

if (threadIdx.x == 0) {

           i = 0;

           mutex = 0;

        }

        __syncthreads();

while( atomicCAS(&mutex, 0, 1) != 0);

        i++;

        printf("thread %d: %d\n", threadIdx.x, i);

        atomicExch(&mutex,0);

}

I’m surprised. Cuda by Example really has code that fails in such an obvious way?

while( atomicCAS(&mutex, 0, 1) != 0);

is a straight deadlock in CUDA. At most one thread can grab the lock, all others have to spin in the loop. However, since all threads of a warp execute in lockstep, the thread that owns the lock cannot proceed to release the lock until all other threads do as well, which never happens.

Thanks for the explanation. The code is from Page 253 of the book.

What’s the correct way to guard a critical section then?

I looked at the book’s code more closely and noticed that the synchronization only happens at block level.

Loop over all threads in a warp, then grab and release again the lock inside the loop.

If you are sure there is no divergence, you can grab the lock from one thread per warp, loop over all threads of this warp, then release the lock.

Better yet, use some lockless algorithm instead.

That makes a lot more sense indeed.

“The ORDER of execution of sub-warps after a WARP Divergence is UNDEFINED”

This is what NVIDIA said when this issue first cropped up.

I have examined some simple cases for Fermi. It is actually pretty well defined - the first BRA always takes immediate effect (whenever there is a BRA, the branch indicated by that BRA is executed first). The problem with the current compiler (EDIT: not current, it’s from 4.0 toolkit) is that the first BRA leads to the branch that attempts to lock again, not the branch that unlocks. Inserting a BRA gets things fixed. Of course, that extra BRA may not be desirable in all cases.

Sorry for the intusion but I also have a similar problem:

if(massa>M0-del && massa<M0+del){

while( (atomicCAS(&flag,0,1))!=1 );

                        printf("IND=%d\n",Ind);

                        Pi0[Ind].x=Candidato.x;

                        Pi0[Ind].y=Candidato.y;

                        Pi0[Ind].z=Candidato.z;

                        Pi0[Ind].Ene=Candidato.Ene;

                        Pi0[Ind].g1=i;

                        Pi0[Ind].g2=j;

                        Ind++;

                        atomicExch(&flag,0);

                }

Pratically, each thread must write in a different position of array Pi0. Is there a mode for not frize the system?

You may use just Ind2=atomicAdd(&Ind,1) instead of atomicCAS with atomicExch, then every thread will write to a unique Ind2.

Oh, I understand you are too familiar with BRAs. But my point is that “Programmers should not assume about which BRA will be selected by the hardware at run-time”. The hardware may change its choice in future and You should not speculate on it as a programmer…

I’m sorry if I said something stupid, but let’s not start a war on this. I was only trying to give some information :) You’re totally right that programmers shouldn’t be concerned with this. The compiler guys should have got it right in the first place. However I’m not sure if this can be considered as a bug. Perhaps the compiler guys made a conscious choice for other reasons.

Hi Hyg…, I was not really picking a fight either… I was just looking at the other meaning of the 3 letter word… :)

External Image only got that just now!

I like most when BRAs are not used at all.

Even the CPU pipes don’t seem to like them…