atomic locks

yifli · January 18, 2012, 8:00pm

Hi all,

The following program used the implementation of atomic locks from ‘Cuda By Example’, but running the program makes my machine frozen.

Can someone tell me what’s wrong with my program? Thanks a lot

Yifei

#include <stdio.h>

__global__ void test()

{

        __shared__ int i, mutex;

if (threadIdx.x == 0) {

           i = 0;

           mutex = 0;

        }

        __syncthreads();

while( atomicCAS(&mutex, 0, 1) != 0);

        i++;

        printf("thread %d: %d\n", threadIdx.x, i);

        atomicExch(&mutex,0);

}

tera · January 19, 2012, 12:45am

I’m surprised. Cuda by Example really has code that fails in such an obvious way?

while( atomicCAS(&mutex, 0, 1) != 0);

is a straight deadlock in CUDA. At most one thread can grab the lock, all others have to spin in the loop. However, since all threads of a warp execute in lockstep, the thread that owns the lock cannot proceed to release the lock until all other threads do as well, which never happens.

yifli · January 19, 2012, 2:21am

I’m surprised. Cuda by Example really has code that fails in such an obvious way?
while( atomicCAS(&mutex, 0, 1) != 0);
is a straight deadlock in CUDA. At most one thread can grab the lock, all others have to spin in the loop. However, since all threads of a warp execute in lockstep, the thread that owns the lock cannot proceed to release the lock until all other threads do as well, which never happens.

Thanks for the explanation. The code is from Page 253 of the book.

What’s the correct way to guard a critical section then?

yifli · January 19, 2012, 2:50am

I looked at the book’s code more closely and noticed that the synchronization only happens at block level.

tera · January 19, 2012, 2:55am

Loop over all threads in a warp, then grab and release again the lock inside the loop.

If you are sure there is no divergence, you can grab the lock from one thread per warp, loop over all threads of this warp, then release the lock.

Better yet, use some lockless algorithm instead.

tera · January 19, 2012, 2:57am

That makes a lot more sense indeed.

Sarnath · January 19, 2012, 1:56pm

“The ORDER of execution of sub-warps after a WARP Divergence is UNDEFINED”

This is what NVIDIA said when this issue first cropped up.

hyqneuron · January 20, 2012, 8:18am

I have examined some simple cases for Fermi. It is actually pretty well defined - the first BRA always takes immediate effect (whenever there is a BRA, the branch indicated by that BRA is executed first). The problem with the current compiler (EDIT: not current, it’s from 4.0 toolkit) is that the first BRA leads to the branch that attempts to lock again, not the branch that unlocks. Inserting a BRA gets things fixed. Of course, that extra BRA may not be desirable in all cases.

local_hero · January 20, 2012, 9:44am

Sorry for the intusion but I also have a similar problem:

if(massa>M0-del && massa<M0+del){

while( (atomicCAS(&flag,0,1))!=1 );

                        printf("IND=%d\n",Ind);

                        Pi0[Ind].x=Candidato.x;

                        Pi0[Ind].y=Candidato.y;

                        Pi0[Ind].z=Candidato.z;

                        Pi0[Ind].Ene=Candidato.Ene;

                        Pi0[Ind].g1=i;

                        Pi0[Ind].g2=j;

                        Ind++;

                        atomicExch(&flag,0);

                }

Pratically, each thread must write in a different position of array Pi0. Is there a mode for not frize the system?

L_F · January 20, 2012, 4:23pm

You may use just Ind2=atomicAdd(&Ind,1) instead of atomicCAS with atomicExch, then every thread will write to a unique Ind2.

Sarnath · January 21, 2012, 8:43am

Oh, I understand you are too familiar with BRAs. But my point is that “Programmers should not assume about which BRA will be selected by the hardware at run-time”. The hardware may change its choice in future and You should not speculate on it as a programmer…

hyqneuron · January 21, 2012, 9:28am

I’m sorry if I said something stupid, but let’s not start a war on this. I was only trying to give some information :) You’re totally right that programmers shouldn’t be concerned with this. The compiler guys should have got it right in the first place. However I’m not sure if this can be considered as a bug. Perhaps the compiler guys made a conscious choice for other reasons.

Sarnath · January 24, 2012, 8:49am

Hi Hyg…, I was not really picking a fight either… I was just looking at the other meaning of the 3 letter word… :)

hyqneuron · January 25, 2012, 7:51am

External Image only got that just now!

cbuchner1 · January 25, 2012, 5:08pm

I like most when BRAs are not used at all.

Sarnath · January 27, 2012, 4:53pm

Even the CPU pipes don’t seem to like them…

Topic		Replies	Views
(Errata for CUDA by Example) Is atomicCAS() safe to simulate lock even with __threadfence()? CUDA Programming and Performance	11	3177	September 17, 2018
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	19760	January 21, 2010
why this deadlocks? try to invoke a critical area CUDA Programming and Performance	11	6099	November 6, 2009
Understanding a spinlock implementation by Robert Crovella CUDA Programming and Performance	6	1626	September 26, 2023
Problem with lock using atomicCAS CUDA Programming and Performance	3	3549	July 19, 2014
Weird behavior of atomic operations on Ampere architecture GPUs CUDA Programming and Performance cuda	9	1411	September 10, 2021
Mutex problem problem with global mutex CUDA Programming and Performance	19	11206	November 17, 2010
Atomic Operations in CUDA CUDA Programming and Performance	5	29237	June 9, 2009
Problem of Hash Table Lock in CUDA CUDA Programming and Performance	6	1278	July 16, 2018
A problem of implementing mutex in CUDA CUDA Programming and Performance	6	1574	June 29, 2017

atomic locks

Related topics