DeadLock on this piece of code. Require your help.

Austin · March 10, 2010, 7:18am

:mellow:

Hello, guys

As the title said, I get this deadlock on the following piece of code.

It is just a try to implement mutex using atomicExch() to lock the global memory to guarantee exclusively access some pieces of code.

And what’s wrong about this?

#include <stdio.h>

typedef struct Bus

{

	int  counter1;

	int  counter2;

}Bus;

__device__ int g_lock;

//#define SHARE_LOCK

#ifdef SHARE_LOCK

	#define LOCK shareLock

#else

	#define LOCK g_lock

#endif

__global__ void checkKernel(Bus *bus)

{

	#ifdef SHARE_LOCK

	__shared__ int shareLock;

	#endif

	if(threadIdx.x == 0)

	{

		LOCK = 0;

	}

	__syncthreads();

	while(atomicExch(&LOCK, 1));

	bus->counter1++;

	bus->counter2--;

	atomicExch(&LOCK,0);

}

int main()

{

	Bus bus;

	bus.counter1 = 0;

	bus.counter2 = 0;

	Bus *d_bus;

	cudaMalloc ((void **)&d_bus, sizeof(Bus));

	cudaMemcpy(d_bus,&bus,sizeof(Bus),cudaMemcpyHostToDevice);

	cudaError_t err;

	checkKernel<<<1,2>>>(d_bus);

	err = cudaThreadSynchronize();

	if(err !=  0)

	{

		printf("Error : %s\n", cudaGetErrorString(cudaGetLastError()));

		return -1;

	}

	cudaMemcpy(&bus,d_bus,sizeof(Bus),cudaMemcpyDeviceToHost);

	printf("counter1 = %d\n counter2 = %d\n",bus.counter1,bus.counter2);

}

And my environment is GTX 280 + linux(Ubuntu) + cuda2.3, and compile with nvcc -g -keep -arch=sm_13 shareMemLock.cu -o shareMemLock.

Also I put the ptx code, since I know a little things about the ptx code, maybe this is helpful for you guys.

.global .s32 g_lock;

	.entry _Z11checkKernelP3Bus (

		.param .u32 __cudaparm__Z11checkKernelP3Bus_bus)

	{

	.reg .u32 %rv1;

	.reg .u32 %r<21>;

	.reg .pred %p<5>;

	.loc	2	20	0

$LBB1__Z11checkKernelP3Bus:

	mov.s32 	%r1, 0;

	ld.global.s32 	%r2, [g_lock];

	cvt.u32.u16 	%r3, %tid.x;

	mov.u32 	%r4, 0;

	setp.eq.u32 	%p1, %r3, %r4;

	selp.s32 	%r5, %r1, %r2, %p1;

	st.global.s32 	[g_lock], %r5;

	.loc	2	31	0

	bar.sync 	0;

	.loc	2	33	0

	mov.u32 	%r6, g_lock;

	mov.s32 	%r7, 1;

	atom.global.exch.b32 	%rv1, [%r6], %r7;

	mov.s32 	%r8, %rv1;

	mov.u32 	%r9, 0;

	setp.eq.s32 	%p2, %r8, %r9;

	@%p2 bra 	$Lt_0_2050;

$Lt_0_2562:

 //<loop> Loop body line 33

	mov.u32 	%r10, g_lock;

	mov.s32 	%r11, 1;

	atom.global.exch.b32 	%rv1, [%r10], %r11;

	mov.s32 	%r8, %rv1;

	mov.u32 	%r12, 0;

	setp.ne.s32 	%p3, %r8, %r12;

	@%p3 bra 	$Lt_0_2562;

$Lt_0_2050:

	.loc	2	34	0

	ld.param.u32 	%r13, [__cudaparm__Z11checkKernelP3Bus_bus];

	ld.global.s32 	%r14, [%r13+0];

	add.s32 	%r15, %r14, 1;

	st.global.s32 	[%r13+0], %r15;

	.loc	2	35	0

	ld.global.s32 	%r16, [%r13+4];

	sub.s32 	%r17, %r16, 1;

	st.global.s32 	[%r13+4], %r17;

	.loc	15	123	0

	mov.u32 	%r18, g_lock;

	mov.s32 	%r19, 0;

	atom.global.exch.b32 	%rv1, [%r18], %r19;

	.loc	2	37	0

	exit;

$LDWend__Z11checkKernelP3Bus:

	} // _Z11checkKernelP3Bus

Thanks for your time and reply!

jam11 · March 10, 2010, 12:14pm

It works if you do

checkKernel<<<1,2>>>(d_bus);

instead of

checkKernel<<<2,1>>>(d_bus);

I don’t know what that does for you.

Noel_Lopes · March 10, 2010, 7:53pm

You have a race condition. The two threads are trying to access the same memory at the same time:

while(atomicExch(&LOCK, 1));

one of them will have to wait (suppose its the second). The second will get the value of one and will continue to get the value of 1 forever, while the 1th thread waits for the warp branches to merge again (and this will never happen).

I think CUDA was not meant to do locks (although you are not the first one to try), but maybe I’m wrong.

tmurray · March 10, 2010, 8:05pm

Intra-warp barriers are not supported. But that’s not the problem here. The obvious problem here is that you’re trying to initialize the lock from all blocks simultaneously and badness will result.

Austin · March 11, 2010, 5:31am

Thanks for you reply, tim.

In the code, I just use one block and with two threads in it. The lock is declared in device memory.

So

is not the reason.

Or I miss something obviously?

Austin · March 11, 2010, 5:35am

Yeah, I know that, but I want use lock between threads in one thread block, and threads compete for resources, the one got the lock can access the resource.

Thanks .

Austin · March 11, 2010, 5:50am

:rolleyes:

atomicExch(&LOCK,1) only allow one thread to access the address &LOCK.

Assume that thread 0 get the lock at the first time, it will get while(0), then it access the exclusive code segment, at the same time, thread 1 always get 1 until thread 0 execute atomicExch(&LOCK,0), and then thread 1 get the privilege to access the bus.

Am I right?

I don’t understand you say

I don’t know the warp branches very well.

Thanks for your reply!!

tmurray · March 11, 2010, 6:19am

Oh, I misread the reply above and didn’t notice what you’re actually doing. You are trying to create an intra-warp barrier; this doesn’t work. Branching within a warp is in many ways a compiler concept. It’s not like you suddenly have two groups of threads running concurrently (because that would mean you created two warps where only one existed previously); one of the branches will run until it’s finished before the other starts (I assume). So, if the branch that doesn’t win runs first, deadlock.

iker_Casillas · March 11, 2010, 8:14am

If you are right, I think we may solve the problem just by this way:

int i=0,flag=0; 

while(1)

{

	while(atomicExch(&LOCK, 1))

	{

		if(i++ >= MAX_THREAD_NUMS)

		{

			i=0;

			flag=1;

			break;	

		}   

	}

	if (flag==0)

	{

		bus->counter1++;

		bus->counter2--;

		atomicExch(&LOCK,0);

		break;

	}

	else

		flag=0;

}

Since the problem “The second will get the value of one and will continue to get the value of 1 forever, while the 1st thread waits for the warp branches to merge again (and this will never happen).”

The key to solve the dead lock problem, on my opinion, is to ensure none of the whole threads can keep on doing atomic operation forever. I order the program to break from the loop if the value of the variable ‘lock’ didn’t change to 0 after a long time. Other threads may get the opportunity to do the “atomic lock” operation , and then on…

After all, the value of lock may be changed to 0 without dead locks.

Austin · March 11, 2010, 9:13am

Tim, I don’t understand about the concept warp, I know the instruction launched in warp mode, and the threads in one thread block would be organized to many warps, the programming guide says

The warp serially executes each branch path taken, disabling threads that are not that path, an when all paths complete, the threads converge back to the same execution path.

And how does this match my code?

Can you explain that for me, thanks very much!

avidday · March 11, 2010, 9:33am

I think the problem is that there is no guarantees about the order in which different branches of divergent code is executed within a warp. It has been demonstrated experimentally that on current hardware with current compilers, pseudo code like this:

if (condition)

	// code path A

else

  // code path B

will actually execute the threads on code path B before the threads on code path A. Normally it isn’t a problem, unless the results of code path B can influence the results of code path A within a running warp, because then the normal “top-to-bottom” way you might read the code doesn’t apply any more. So in your code, it might be logical to assume the atomic exchange loop will split the warp into a “winner” thread and a group of “loser” threads. But you can’t assume that the “winner” thread will be executed before the “loser” threads, and that is where I think the deadlock arises.

Austin · March 12, 2010, 7:21am

I think the problem is that there is no guarantees about the order in which different branches of divergent code is executed within a warp. It has been demonstrated experimentally that on current hardware with current compilers, pseudo code like this:
if (condition)

	// code path A

else

  // code path B
will actually execute the threads on code path B before the threads on code path A. Normally it isn’t a problem, unless the results of code path B can influence the results of code path A within a running warp, because then the normal “top-to-bottom” way you might read the code doesn’t apply any more. So in your code, it might be logical to assume the atomic exchange loop will split the warp into a “winner” thread and a group of “loser” threads. But you can’t assume that the “winner” thread will be executed before the “loser” threads, and that is where I think the deadlock arises.

Yeahm, this explanation sounds reasonable.

But what should I do to implement the intra-warp mutex? There is no way to do this? I have ever seen the topic ablout shared memory lock on this: http://forums.nvidia.com/index.php?showtop…&pid=569277 , But there is not a good way to solve it.

avidday · March 12, 2010, 9:08am

I don’t believe it is possible, but you can write my formal computer science education down on the back of a postage stamp, so what do I know.

Austin · March 17, 2010, 7:25am

:mellow:

Hello, guys

As the title said, I get this deadlock on the following piece of code.

It is just a try to implement mutex using atomicExch() to lock the global memory to guarantee exclusively access some pieces of code.

And what’s wrong about this?

#include <stdio.h>

typedef struct Bus

{

	int  counter1;

	int  counter2;

}Bus;

__device__ int g_lock;

//#define SHARE_LOCK

#ifdef SHARE_LOCK

	#define LOCK shareLock

#else

	#define LOCK g_lock

#endif

__global__ void checkKernel(Bus *bus)

{

	#ifdef SHARE_LOCK

	__shared__ int shareLock;

	#endif

	if(threadIdx.x == 0)

	{

		LOCK = 0;

	}

	__syncthreads();

	while(atomicExch(&LOCK, 1));

	bus->counter1++;

	bus->counter2--;

	atomicExch(&LOCK,0);

}

int main()

{

	Bus bus;

	bus.counter1 = 0;

	bus.counter2 = 0;

	Bus *d_bus;

	cudaMalloc ((void **)&d_bus, sizeof(Bus));

	cudaMemcpy(d_bus,&bus,sizeof(Bus),cudaMemcpyHostToDevice);

	cudaError_t err;

	checkKernel<<<1,2>>>(d_bus);

	err = cudaThreadSynchronize();

	if(err !=  0)

	{

		printf("Error : %s\n", cudaGetErrorString(cudaGetLastError()));

		return -1;

	}

	cudaMemcpy(&bus,d_bus,sizeof(Bus),cudaMemcpyDeviceToHost);

	printf("counter1 = %d\n counter2 = %d\n",bus.counter1,bus.counter2);

}

And my environment is GTX 280 + linux(Ubuntu) + cuda2.3, and compile with nvcc -g -keep -arch=sm_13 shareMemLock.cu -o shareMemLock.

Also I put the ptx code, since I know a little things about the ptx code, maybe this is helpful for you guys.

.global .s32 g_lock;

	.entry _Z11checkKernelP3Bus (

		.param .u32 __cudaparm__Z11checkKernelP3Bus_bus)

	{

	.reg .u32 %rv1;

	.reg .u32 %r<21>;

	.reg .pred %p<5>;

	.loc	2	20	0

$LBB1__Z11checkKernelP3Bus:

	mov.s32 	%r1, 0;

	ld.global.s32 	%r2, [g_lock];

	cvt.u32.u16 	%r3, %tid.x;

	mov.u32 	%r4, 0;

	setp.eq.u32 	%p1, %r3, %r4;

	selp.s32 	%r5, %r1, %r2, %p1;

	st.global.s32 	[g_lock], %r5;

	.loc	2	31	0

	bar.sync 	0;

	.loc	2	33	0

	mov.u32 	%r6, g_lock;

	mov.s32 	%r7, 1;

	atom.global.exch.b32 	%rv1, [%r6], %r7;

	mov.s32 	%r8, %rv1;

	mov.u32 	%r9, 0;

	setp.eq.s32 	%p2, %r8, %r9;

	@%p2 bra 	$Lt_0_2050;

$Lt_0_2562:

 //<loop> Loop body line 33

	mov.u32 	%r10, g_lock;

	mov.s32 	%r11, 1;

	atom.global.exch.b32 	%rv1, [%r10], %r11;

	mov.s32 	%r8, %rv1;

	mov.u32 	%r12, 0;

	setp.ne.s32 	%p3, %r8, %r12;

	@%p3 bra 	$Lt_0_2562;

$Lt_0_2050:

	.loc	2	34	0

	ld.param.u32 	%r13, [__cudaparm__Z11checkKernelP3Bus_bus];

	ld.global.s32 	%r14, [%r13+0];

	add.s32 	%r15, %r14, 1;

	st.global.s32 	[%r13+0], %r15;

	.loc	2	35	0

	ld.global.s32 	%r16, [%r13+4];

	sub.s32 	%r17, %r16, 1;

	st.global.s32 	[%r13+4], %r17;

	.loc	15	123	0

	mov.u32 	%r18, g_lock;

	mov.s32 	%r19, 0;

	atom.global.exch.b32 	%rv1, [%r18], %r19;

	.loc	2	37	0

	exit;

$LDWend__Z11checkKernelP3Bus:

	} // _Z11checkKernelP3Bus

Thanks for your time and reply!

Today I will finish this topic.

I use the following code to implement a shared memory lock:

#include <stdio.h>

typedef struct Bus

{

	int  counter1;

	int  counter2;

}Bus;

#define LOCK shareLock

__global__ void checkKernel(Bus *bus)

{

	 __shared__ int shareLock;

	if(threadIdx.x == 0)

	{

	  LOCK = 0;

	}

	__syncthreads();

	while(1)

	{

		if(atomicExch(&LOCK,1) == 0)

		{

			bus->counter1++;

			bus->counter2++;

			__threadfence();

			participated = true;

			atomicExch(&LOCK,0);

		}

		if(participated)

		{

			break;

		}

	}

	

}

int main(int argc,char **argv)

{

	if(argc < 3)

	{

		printf("Usage format: programName grid block ");

		exit(1);

	}

	int grid = atoi(argv[1]);

	int block = atoi(argv[2]);

	Bus bus;

	bus.counter1 = 0;

	bus.counter2 = 0;

	Bus *d_bus;

	cudaMalloc ((void **)&d_bus, sizeof(Bus));

	cudaMemcpy(d_bus,&bus,sizeof(Bus),cudaMemcpyHostToDevice);

	cudaError_t err;

	checkKernel<<<grid,block>>>(d_bus);

	err = cudaThreadSynchronize();

	if(err !=  0)

	{

		printf("Error : %s\n", cudaGetErrorString(cudaGetLastError()));

		return -1;

	}

	cudaMemcpy(&bus,d_bus,sizeof(Bus),cudaMemcpyDeviceToHost);

	printf("counter1 = %d\n counter2 = %d\n",bus.counter1,bus.counter2);

}

Topic		Replies	Views
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	19742	January 21, 2010
Shared mem atomics Repeat topic CUDA Programming and Performance	47	8781	December 1, 2009
Problem of Hash Table Lock in CUDA CUDA Programming and Performance	6	1257	July 16, 2018
Shared mem atomics Help needed to Fix this hang CUDA Programming and Performance	12	11761	June 8, 2009
Problem with correct branching within a warp CUDA Programming and Performance	23	15634	May 28, 2009
Test Multi Threading Spinning CUDA Programming and Performance	32	4810	July 20, 2011
help with kernel synchronization? CUDA Programming and Performance	22	13899	August 26, 2010
Mutex problem problem with global mutex CUDA Programming and Performance	19	11199	November 17, 2010
Controlling context switching in CUDA CUDA Programming and Performance	16	4442	April 29, 2013
Using Shared Memory in CUDA C/C++ Technical Blog	36	1954	October 8, 2020

DeadLock on this piece of code. Require your help.

Related topics