DeadLock on this piece of code. Require your help.

:mellow:

Hello, guys

As the title said, I get this deadlock on the following piece of code.

It is just a try to implement mutex using atomicExch() to lock the global memory to guarantee exclusively access some pieces of code.

And what’s wrong about this?

#include <stdio.h>

typedef struct Bus

{

	int  counter1;

	int  counter2;

}Bus;

__device__ int g_lock;

//#define SHARE_LOCK

#ifdef SHARE_LOCK

	#define LOCK shareLock

#else

	#define LOCK g_lock

#endif

__global__ void checkKernel(Bus *bus)

{

	#ifdef SHARE_LOCK

	__shared__ int shareLock;

	#endif

	if(threadIdx.x == 0)

	{

		LOCK = 0;

	}

	__syncthreads();

	while(atomicExch(&LOCK, 1));

	bus->counter1++;

	bus->counter2--;

	atomicExch(&LOCK,0);

}

int main()

{

	Bus bus;

	bus.counter1 = 0;

	bus.counter2 = 0;

	Bus *d_bus;

	cudaMalloc ((void **)&d_bus, sizeof(Bus));

	cudaMemcpy(d_bus,&bus,sizeof(Bus),cudaMemcpyHostToDevice);

	cudaError_t err;

	checkKernel<<<1,2>>>(d_bus);

	err = cudaThreadSynchronize();

	if(err !=  0)

	{

		printf("Error : %s\n", cudaGetErrorString(cudaGetLastError()));

		return -1;

	}

	cudaMemcpy(&bus,d_bus,sizeof(Bus),cudaMemcpyDeviceToHost);

	printf("counter1 = %d\n counter2 = %d\n",bus.counter1,bus.counter2);

}

And my environment is GTX 280 + linux(Ubuntu) + cuda2.3, and compile with nvcc -g -keep -arch=sm_13 shareMemLock.cu -o shareMemLock.

Also I put the ptx code, since I know a little things about the ptx code, maybe this is helpful for you guys.

.global .s32 g_lock;

	.entry _Z11checkKernelP3Bus (

		.param .u32 __cudaparm__Z11checkKernelP3Bus_bus)

	{

	.reg .u32 %rv1;

	.reg .u32 %r<21>;

	.reg .pred %p<5>;

	.loc	2	20	0

$LBB1__Z11checkKernelP3Bus:

	mov.s32 	%r1, 0;

	ld.global.s32 	%r2, [g_lock];

	cvt.u32.u16 	%r3, %tid.x;

	mov.u32 	%r4, 0;

	setp.eq.u32 	%p1, %r3, %r4;

	selp.s32 	%r5, %r1, %r2, %p1;

	st.global.s32 	[g_lock], %r5;

	.loc	2	31	0

	bar.sync 	0;

	.loc	2	33	0

	mov.u32 	%r6, g_lock;

	mov.s32 	%r7, 1;

	atom.global.exch.b32 	%rv1, [%r6], %r7;

	mov.s32 	%r8, %rv1;

	mov.u32 	%r9, 0;

	setp.eq.s32 	%p2, %r8, %r9;

	@%p2 bra 	$Lt_0_2050;

$Lt_0_2562:

 //<loop> Loop body line 33

	mov.u32 	%r10, g_lock;

	mov.s32 	%r11, 1;

	atom.global.exch.b32 	%rv1, [%r10], %r11;

	mov.s32 	%r8, %rv1;

	mov.u32 	%r12, 0;

	setp.ne.s32 	%p3, %r8, %r12;

	@%p3 bra 	$Lt_0_2562;

$Lt_0_2050:

	.loc	2	34	0

	ld.param.u32 	%r13, [__cudaparm__Z11checkKernelP3Bus_bus];

	ld.global.s32 	%r14, [%r13+0];

	add.s32 	%r15, %r14, 1;

	st.global.s32 	[%r13+0], %r15;

	.loc	2	35	0

	ld.global.s32 	%r16, [%r13+4];

	sub.s32 	%r17, %r16, 1;

	st.global.s32 	[%r13+4], %r17;

	.loc	15	123	0

	mov.u32 	%r18, g_lock;

	mov.s32 	%r19, 0;

	atom.global.exch.b32 	%rv1, [%r18], %r19;

	.loc	2	37	0

	exit;

$LDWend__Z11checkKernelP3Bus:

	} // _Z11checkKernelP3Bus

Thanks for your time and reply!

It works if you do

checkKernel<<<1,2>>>(d_bus);

instead of

checkKernel<<<2,1>>>(d_bus);

I don’t know what that does for you.

You have a race condition. The two threads are trying to access the same memory at the same time:

while(atomicExch(&LOCK, 1));

one of them will have to wait (suppose its the second). The second will get the value of one and will continue to get the value of 1 forever, while the 1th thread waits for the warp branches to merge again (and this will never happen).

I think CUDA was not meant to do locks (although you are not the first one to try), but maybe I’m wrong.

Intra-warp barriers are not supported. But that’s not the problem here. The obvious problem here is that you’re trying to initialize the lock from all blocks simultaneously and badness will result.

Thanks for you reply, tim.

In the code, I just use one block and with two threads in it. The lock is declared in device memory.

So

is not the reason.

Or I miss something obviously?

Yeah, I know that, but I want use lock between threads in one thread block, and threads compete for resources, the one got the lock can access the resource.

Thanks .

:rolleyes:

atomicExch(&LOCK,1) only allow one thread to access the address &LOCK.

Assume that thread 0 get the lock at the first time, it will get while(0), then it access the exclusive code segment, at the same time, thread 1 always get 1 until thread 0 execute atomicExch(&LOCK,0), and then thread 1 get the privilege to access the bus.

Am I right?

I don’t understand you say

I don’t know the warp branches very well.

Thanks for your reply!!

Oh, I misread the reply above and didn’t notice what you’re actually doing. You are trying to create an intra-warp barrier; this doesn’t work. Branching within a warp is in many ways a compiler concept. It’s not like you suddenly have two groups of threads running concurrently (because that would mean you created two warps where only one existed previously); one of the branches will run until it’s finished before the other starts (I assume). So, if the branch that doesn’t win runs first, deadlock.

If you are right, I think we may solve the problem just by this way:

int i=0,flag=0; 

while(1)

{

	while(atomicExch(&LOCK, 1))

	{

		if(i++ >= MAX_THREAD_NUMS)

		{

			i=0;

			flag=1;

			break;	

		}   

	}

	if (flag==0)

	{

		bus->counter1++;

		bus->counter2--;

		atomicExch(&LOCK,0);

		break;

	}

	else

		flag=0;

}

Since the problem “The second will get the value of one and will continue to get the value of 1 forever, while the 1st thread waits for the warp branches to merge again (and this will never happen).”

The key to solve the dead lock problem, on my opinion, is to ensure none of the whole threads can keep on doing atomic operation forever. I order the program to break from the loop if the value of the variable ‘lock’ didn’t change to 0 after a long time. Other threads may get the opportunity to do the “atomic lock” operation , and then on…

After all, the value of lock may be changed to 0 without dead locks.

Tim, I don’t understand about the concept warp, I know the instruction launched in warp mode, and the threads in one thread block would be organized to many warps, the programming guide says

The warp serially executes each branch path taken, disabling threads that are not that path, an when all paths complete, the threads converge back to the same execution path.

And how does this match my code?

Can you explain that for me, thanks very much!

I think the problem is that there is no guarantees about the order in which different branches of divergent code is executed within a warp. It has been demonstrated experimentally that on current hardware with current compilers, pseudo code like this:

if (condition)

	// code path A

else

  // code path B

will actually execute the threads on code path B before the threads on code path A. Normally it isn’t a problem, unless the results of code path B can influence the results of code path A within a running warp, because then the normal “top-to-bottom” way you might read the code doesn’t apply any more. So in your code, it might be logical to assume the atomic exchange loop will split the warp into a “winner” thread and a group of “loser” threads. But you can’t assume that the “winner” thread will be executed before the “loser” threads, and that is where I think the deadlock arises.

Yeahm, this explanation sounds reasonable.

But what should I do to implement the intra-warp mutex? There is no way to do this? I have ever seen the topic ablout shared memory lock on this: http://forums.nvidia.com/index.php?showtop…&pid=569277 , But there is not a good way to solve it.

I don’t believe it is possible, but you can write my formal computer science education down on the back of a postage stamp, so what do I know.

Today I will finish this topic.

I use the following code to implement a shared memory lock:

#include <stdio.h>

typedef struct Bus

{

	int  counter1;

	int  counter2;

}Bus;

#define LOCK shareLock

__global__ void checkKernel(Bus *bus)

{

	 __shared__ int shareLock;

	if(threadIdx.x == 0)

	{

	  LOCK = 0;

	}

	__syncthreads();

	while(1)

	{

		if(atomicExch(&LOCK,1) == 0)

		{

			bus->counter1++;

			bus->counter2++;

			__threadfence();

			participated = true;

			atomicExch(&LOCK,0);

		}

		if(participated)

		{

			break;

		}

	}

	

}

int main(int argc,char **argv)

{

	if(argc < 3)

	{

		printf("Usage format: programName grid block ");

		exit(1);

	}

	int grid = atoi(argv[1]);

	int block = atoi(argv[2]);

	Bus bus;

	bus.counter1 = 0;

	bus.counter2 = 0;

	Bus *d_bus;

	cudaMalloc ((void **)&d_bus, sizeof(Bus));

	cudaMemcpy(d_bus,&bus,sizeof(Bus),cudaMemcpyHostToDevice);

	cudaError_t err;

	checkKernel<<<grid,block>>>(d_bus);

	err = cudaThreadSynchronize();

	if(err !=  0)

	{

		printf("Error : %s\n", cudaGetErrorString(cudaGetLastError()));

		return -1;

	}

	cudaMemcpy(&bus,d_bus,sizeof(Bus),cudaMemcpyDeviceToHost);

	printf("counter1 = %d\n counter2 = %d\n",bus.counter1,bus.counter2);

}