Mutex problem problem with global mutex

MichaA_Skrzypkowski · November 16, 2010, 1:40pm

I implemented a simple mutex mechanism:

__device__ void lock(int *mutex)

{

	while(atomicCAS(mutex, 0, 1) != 0);

}

__device__ void unlock(int *mutex)

{

	atomicExch(mutex, 0);

}

And I`m running a simple kernel to test it:

__global__ void kernel(int *mutex, int *acc)

{

	lock(mutex);

	++(*acc);

	unlock(mutex);

}

By invoking:

int *mutex, *acc, res;

	res = -1;

	CUDA_CALL(cudaMalloc((void **)&mutex, sizeof(int)));

	CUDA_CALL(cudaMemset(mutex, 0, sizeof(int)));

	CUDA_CALL(cudaMalloc((void **)&acc, sizeof(int)));

	CUDA_CALL(cudaMemset(acc, 0, sizeof(int)));

	kernel<<<blocks, threads>>>(devStates, mutex, acc);

	CUDA_CALL(cudaMemcpy(&res, acc, sizeof(int), cudaMemcpyDeviceToHost));

	printf("Result: %d\n", res);

So when I run with:

[list=1]

[*]blocks(16,16); threads(1,1);

[*]without using mutex

I got ~50 as a result → as expected → OK

[*]using mutex

I got 256 as a result → OK

[*]blocks(1,1); threads(16,16);

[*]without using mutex

I got 1 as a result → probable → OK

[*]using mutex

The program gets stuck and I don`t know why…

I know there maybe some race conditions, but I don’t see where.

Also it works fine when using one thread per block, so theoretically

it should work with one block and many threads.

Also I can’t debbug it, as no break point is hit in any of the kernels.

Got any ideas/solutions ? :)

MichaA_Skrzypkowski · November 16, 2010, 1:40pm

I implemented a simple mutex mechanism:

__device__ void lock(int *mutex)

{

	while(atomicCAS(mutex, 0, 1) != 0);

}

__device__ void unlock(int *mutex)

{

	atomicExch(mutex, 0);

}

And I`m running a simple kernel to test it:

__global__ void kernel(int *mutex, int *acc)

{

	lock(mutex);

	++(*acc);

	unlock(mutex);

}

By invoking:

int *mutex, *acc, res;

	res = -1;

	CUDA_CALL(cudaMalloc((void **)&mutex, sizeof(int)));

	CUDA_CALL(cudaMemset(mutex, 0, sizeof(int)));

	CUDA_CALL(cudaMalloc((void **)&acc, sizeof(int)));

	CUDA_CALL(cudaMemset(acc, 0, sizeof(int)));

	kernel<<<blocks, threads>>>(devStates, mutex, acc);

	CUDA_CALL(cudaMemcpy(&res, acc, sizeof(int), cudaMemcpyDeviceToHost));

	printf("Result: %d\n", res);

So when I run with:

[list=1]

[*]blocks(16,16); threads(1,1);

[*]without using mutex

I got ~50 as a result → as expected → OK

[*]using mutex

I got 256 as a result → OK

[*]blocks(1,1); threads(16,16);

[*]without using mutex

I got 1 as a result → probable → OK

[*]using mutex

The program gets stuck and I don`t know why…

I know there maybe some race conditions, but I don’t see where.

Also it works fine when using one thread per block, so theoretically

it should work with one block and many threads.

Also I can’t debbug it, as no break point is hit in any of the kernels.

Got any ideas/solutions ? :)

AlexanderMalishev · November 16, 2010, 2:29pm

I implemented a simple mutex mechanism:
__device__ void lock(int *mutex)

{

	while(atomicCAS(mutex, 0, 1) != 0);

}

__device__ void unlock(int *mutex)

{

	atomicExch(mutex, 0);

}
I know there maybe some race conditions, but I don’t see where.

Also it works fine when using one thread per block, so theoretically

it should work with one block and many threads.

Also I can’t debbug it, as no break point is hit in any of the kernels.

Got any ideas/solutions ? :)

There is implicit synchronization point after “while” ( similar to __syncthreads() )

AlexanderMalishev · November 16, 2010, 2:29pm

I implemented a simple mutex mechanism:
__device__ void lock(int *mutex)

{

	while(atomicCAS(mutex, 0, 1) != 0);

}

__device__ void unlock(int *mutex)

{

	atomicExch(mutex, 0);

}
I know there maybe some race conditions, but I don’t see where.

Also it works fine when using one thread per block, so theoretically

it should work with one block and many threads.

Also I can’t debbug it, as no break point is hit in any of the kernels.

Got any ideas/solutions ? :)

There is implicit synchronization point after “while” ( similar to __syncthreads() )

MichaA_Skrzypkowski · November 16, 2010, 2:42pm

Is there a way to avoid that ?

MichaA_Skrzypkowski · November 16, 2010, 2:42pm

Is there a way to avoid that ?

AlexanderMalishev · November 16, 2010, 3:06pm

No. There is no way to control synchronization points.

Here is you code:

00000000: 2800400080001de4     mov b32 $r0 c0[0x20]

00000008: 1800000004005de2     mov b32 $r1 0x1

00000010: 6000000100000007     joinat 0x50

00000018: 50020800000fdd25     cas b32 $r1 g[$r0] 0 $r1

00000020: 190e0000fc11dc23     set $p0 eq s32 $r1 0

00000028: 40000000800001e7   B $p0 bra 0x48

00000030: 1800000004005de2     mov b32 $r1 0x1

00000038: 50020800000fdd25     cas b32 $r1 g[$r0] 0 $r1

00000040: 1a8e0000fc11dc23     set $p0 ne s32 $r1 0

00000048: 4003ffff800001e7   B $p0 bra 0x28

00000050: 2800400090009df4   B join mov b32 $r2 c0[0x24]

00000058: 8000000000205c85     ld b32 $r1 ca g[$r2]

00000060: 4800c00004105c03     add b32 $r1 $r1 0x1

00000068: 9000000000205c85     st b32 wb g[$r2] $r1

00000070: 507ff800000fdd05     exch b32 0 g[$r0] 0

00000078: 8000000000001de7     exit

Joinat instruction and join modifier should be removed, but I don’t know how to do it.

AlexanderMalishev · November 16, 2010, 3:06pm

No. There is no way to control synchronization points.

Here is you code:

00000000: 2800400080001de4     mov b32 $r0 c0[0x20]

00000008: 1800000004005de2     mov b32 $r1 0x1

00000010: 6000000100000007     joinat 0x50

00000018: 50020800000fdd25     cas b32 $r1 g[$r0] 0 $r1

00000020: 190e0000fc11dc23     set $p0 eq s32 $r1 0

00000028: 40000000800001e7   B $p0 bra 0x48

00000030: 1800000004005de2     mov b32 $r1 0x1

00000038: 50020800000fdd25     cas b32 $r1 g[$r0] 0 $r1

00000040: 1a8e0000fc11dc23     set $p0 ne s32 $r1 0

00000048: 4003ffff800001e7   B $p0 bra 0x28

00000050: 2800400090009df4   B join mov b32 $r2 c0[0x24]

00000058: 8000000000205c85     ld b32 $r1 ca g[$r2]

00000060: 4800c00004105c03     add b32 $r1 $r1 0x1

00000068: 9000000000205c85     st b32 wb g[$r2] $r1

00000070: 507ff800000fdd05     exch b32 0 g[$r0] 0

00000078: 8000000000001de7     exit

Joinat instruction and join modifier should be removed, but I don’t know how to do it.

MichaA_Skrzypkowski · November 16, 2010, 4:02pm

I’m reading CUDA by Example.
On books website you can download source code for examples inside the book.
The lock mechanism and examples are in appendix_a folder.
They successfully implemented mutex mechanism with the while loop.
My question is why their solution works, and mine not (I even tried the same lock construction as theirs).

MichaA_Skrzypkowski · November 16, 2010, 4:02pm

I’m reading CUDA by Example.
On books website you can download source code for examples inside the book.
The lock mechanism and examples are in appendix_a folder.
They successfully implemented mutex mechanism with the while loop.
My question is why their solution works, and mine not (I even tried the same lock construction as theirs).

AlexanderMalishev · November 16, 2010, 5:12pm

They use lock to synchronize between warps, not between threads. Only one thread per warp is active then lock/unlock is called.

for (int i=0; i<32; i++) {

            if ((tid % 32) == i) {

                lock();

                doSomething();

                unlock();

            }

AlexanderMalishev · November 16, 2010, 5:12pm

They use lock to synchronize between warps, not between threads. Only one thread per warp is active then lock/unlock is called.

for (int i=0; i<32; i++) {

            if ((tid % 32) == i) {

                lock();

                doSomething();

                unlock();

            }

MichaA_Skrzypkowski · November 16, 2010, 6:15pm

They use lock to synchronize between warps, not between threads. Only one thread per warp is active then lock/unlock is called.
for (int i=0; i<32; i++) {

            if ((tid % 32) == i) {

                lock();

                doSomething();

                unlock();

            }

Mmm… I see now. So my problem is that i block whole warps instead of threads.

So I have to figure out another way to communicate :)

But at least I know I can use semaphores for communication between blocks of threaads,

and in some cases between warps.

Thanks for helping me understand the problem.

MichaA_Skrzypkowski · November 16, 2010, 6:15pm

They use lock to synchronize between warps, not between threads. Only one thread per warp is active then lock/unlock is called.
for (int i=0; i<32; i++) {

            if ((tid % 32) == i) {

                lock();

                doSomething();

                unlock();

            }

Mmm… I see now. So my problem is that i block whole warps instead of threads.

So I have to figure out another way to communicate :)

But at least I know I can use semaphores for communication between blocks of threaads,

and in some cases between warps.

Thanks for helping me understand the problem.

tera · November 16, 2010, 6:20pm

Use the Mutex to synchronize warps, then loop over the threads in the warp once the warp acquired the lock.

tera · November 16, 2010, 6:20pm

Use the Mutex to synchronize warps, then loop over the threads in the warp once the warp acquired the lock.

SPWorley · November 16, 2010, 11:32pm

Just reorder your instructions so they don’t spin on the atomic itself. Don’t “keep trying until I have the lock”, but instead “try for the lock. If I got it, do my private stuff, then release it. Else go back and try again.” Notice the subtle difference in WHEN you test for lock attempt failure… but that difference prevents deadlock.

Something like this (untested) code.

bool needlock=true;

while (needlock) {

  if (0==atomicCAS(mutex, 0, 1)) {

     /* I have the lock */

     doSomething();

     /* release the lock */

     atomicExch(mutex, 0);

     needlock=false;

   }

}

To make a library, don’t make a lock() function, instead make an attemptLock() function which returns a boolean true if it succeeds, allowing you to use it easily in constructs like the above. Test for SUCCESS first, not failure.

SPWorley · November 16, 2010, 11:32pm

Just reorder your instructions so they don’t spin on the atomic itself. Don’t “keep trying until I have the lock”, but instead “try for the lock. If I got it, do my private stuff, then release it. Else go back and try again.” Notice the subtle difference in WHEN you test for lock attempt failure… but that difference prevents deadlock.

Something like this (untested) code.

bool needlock=true;

while (needlock) {

  if (0==atomicCAS(mutex, 0, 1)) {

     /* I have the lock */

     doSomething();

     /* release the lock */

     atomicExch(mutex, 0);

     needlock=false;

   }

}

To make a library, don’t make a lock() function, instead make an attemptLock() function which returns a boolean true if it succeeds, allowing you to use it easily in constructs like the above. Test for SUCCESS first, not failure.

MichaA_Skrzypkowski · November 17, 2010, 6:27pm

Just reorder your instructions so they don’t spin on the atomic itself. Don’t “keep trying until I have the lock”, but instead “try for the lock. If I got it, do my private stuff, then release it. Else go back and try again.” Notice the subtle difference in WHEN you test for lock attempt failure… but that difference prevents deadlock.

Something like this (untested) code.
bool needlock=true;

while (needlock) {

  if (0==atomicCAS(mutex, 0, 1)) {

     /* I have the lock */

     doSomething();

     /* release the lock */

     atomicExch(mutex, 0);

     needlock=false;

   }

}
To make a library, don’t make a lock() function, instead make an attemptLock() function which returns a boolean true if it succeeds, allowing you to use it easily in constructs like the above. Test for SUCCESS first, not failure.

This does work :)

Although it doesn’t separate the code as I wanted to…

Thank you for your help.

MichaA_Skrzypkowski · November 17, 2010, 6:27pm

Just reorder your instructions so they don’t spin on the atomic itself. Don’t “keep trying until I have the lock”, but instead “try for the lock. If I got it, do my private stuff, then release it. Else go back and try again.” Notice the subtle difference in WHEN you test for lock attempt failure… but that difference prevents deadlock.

Something like this (untested) code.
bool needlock=true;

while (needlock) {

  if (0==atomicCAS(mutex, 0, 1)) {

     /* I have the lock */

     doSomething();

     /* release the lock */

     atomicExch(mutex, 0);

     needlock=false;

   }

}
To make a library, don’t make a lock() function, instead make an attemptLock() function which returns a boolean true if it succeeds, allowing you to use it easily in constructs like the above. Test for SUCCESS first, not failure.

This does work :)

Although it doesn’t separate the code as I wanted to…

Thank you for your help.

Topic		Replies	Views
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	19751	January 21, 2010
DeadLock on this piece of code. Require your help. CUDA Programming and Performance	13	4738	March 17, 2010
Problem of Hash Table Lock in CUDA CUDA Programming and Performance	6	1272	July 16, 2018
Understanding a spinlock implementation by Robert Crovella CUDA Programming and Performance	6	1535	September 26, 2023
Try to use lock and unlock in CUDA CUDA Programming and Performance	1	19281	June 14, 2017
How to pass a struct to a kernel? CUDA Programming and Performance	8	2341	March 19, 2019
(Errata for CUDA by Example) Is atomicCAS() safe to simulate lock even with __threadfence()? CUDA Programming and Performance	11	3170	September 17, 2018
Mutual exclusion or Reduction on global memory? CUDA Programming and Performance	13	9090	September 13, 2008
WHAT IT'S WRONG IN THIS KERNEL CODE? CUDA Programming and Performance	4	1951	June 17, 2008
Shared mem atomics Repeat topic CUDA Programming and Performance	47	8781	December 1, 2009

Mutex problem problem with global mutex

Related topics