Mutex problem problem with global mutex

I implemented a simple mutex mechanism:

__device__ void lock(int *mutex)

{

	while(atomicCAS(mutex, 0, 1) != 0);

}

__device__ void unlock(int *mutex)

{

	atomicExch(mutex, 0);

}

And I`m running a simple kernel to test it:

__global__ void kernel(int *mutex, int *acc)

{

	lock(mutex);

	++(*acc);

	unlock(mutex);

}

By invoking:

int *mutex, *acc, res;

	res = -1;

	CUDA_CALL(cudaMalloc((void **)&mutex, sizeof(int)));

	CUDA_CALL(cudaMemset(mutex, 0, sizeof(int)));

	CUDA_CALL(cudaMalloc((void **)&acc, sizeof(int)));

	CUDA_CALL(cudaMemset(acc, 0, sizeof(int)));

	kernel<<<blocks, threads>>>(devStates, mutex, acc);

	CUDA_CALL(cudaMemcpy(&res, acc, sizeof(int), cudaMemcpyDeviceToHost));

	printf("Result: %d\n", res);

So when I run with:

[list=1]

[*]blocks(16,16); threads(1,1);

    [*]without using mutex

    I got ~50 as a result → as expected → OK

    [*]using mutex

    I got 256 as a result → OK

[*]blocks(1,1); threads(16,16);

    [*]without using mutex

    I got 1 as a result → probable → OK

    [*]using mutex

    The program gets stuck and I don`t know why…

I know there maybe some race conditions, but I don’t see where.

Also it works fine when using one thread per block, so theoretically

it should work with one block and many threads.

Also I can’t debbug it, as no break point is hit in any of the kernels.

Got any ideas/solutions ? :)

I implemented a simple mutex mechanism:

__device__ void lock(int *mutex)

{

	while(atomicCAS(mutex, 0, 1) != 0);

}

__device__ void unlock(int *mutex)

{

	atomicExch(mutex, 0);

}

And I`m running a simple kernel to test it:

__global__ void kernel(int *mutex, int *acc)

{

	lock(mutex);

	++(*acc);

	unlock(mutex);

}

By invoking:

int *mutex, *acc, res;

	res = -1;

	CUDA_CALL(cudaMalloc((void **)&mutex, sizeof(int)));

	CUDA_CALL(cudaMemset(mutex, 0, sizeof(int)));

	CUDA_CALL(cudaMalloc((void **)&acc, sizeof(int)));

	CUDA_CALL(cudaMemset(acc, 0, sizeof(int)));

	kernel<<<blocks, threads>>>(devStates, mutex, acc);

	CUDA_CALL(cudaMemcpy(&res, acc, sizeof(int), cudaMemcpyDeviceToHost));

	printf("Result: %d\n", res);

So when I run with:

[list=1]

[*]blocks(16,16); threads(1,1);

    [*]without using mutex

    I got ~50 as a result → as expected → OK

    [*]using mutex

    I got 256 as a result → OK

[*]blocks(1,1); threads(16,16);

    [*]without using mutex

    I got 1 as a result → probable → OK

    [*]using mutex

    The program gets stuck and I don`t know why…

I know there maybe some race conditions, but I don’t see where.

Also it works fine when using one thread per block, so theoretically

it should work with one block and many threads.

Also I can’t debbug it, as no break point is hit in any of the kernels.

Got any ideas/solutions ? :)

There is implicit synchronization point after “while” ( similar to __syncthreads() )

There is implicit synchronization point after “while” ( similar to __syncthreads() )

Is there a way to avoid that ?

Is there a way to avoid that ?

No. There is no way to control synchronization points.

Here is you code:

00000000: 2800400080001de4     mov b32 $r0 c0[0x20]

00000008: 1800000004005de2     mov b32 $r1 0x1

00000010: 6000000100000007     joinat 0x50

00000018: 50020800000fdd25     cas b32 $r1 g[$r0] 0 $r1

00000020: 190e0000fc11dc23     set $p0 eq s32 $r1 0

00000028: 40000000800001e7   B $p0 bra 0x48

00000030: 1800000004005de2     mov b32 $r1 0x1

00000038: 50020800000fdd25     cas b32 $r1 g[$r0] 0 $r1

00000040: 1a8e0000fc11dc23     set $p0 ne s32 $r1 0

00000048: 4003ffff800001e7   B $p0 bra 0x28

00000050: 2800400090009df4   B join mov b32 $r2 c0[0x24]

00000058: 8000000000205c85     ld b32 $r1 ca g[$r2]

00000060: 4800c00004105c03     add b32 $r1 $r1 0x1

00000068: 9000000000205c85     st b32 wb g[$r2] $r1

00000070: 507ff800000fdd05     exch b32 0 g[$r0] 0

00000078: 8000000000001de7     exit

Joinat instruction and join modifier should be removed, but I don’t know how to do it.

No. There is no way to control synchronization points.

Here is you code:

00000000: 2800400080001de4     mov b32 $r0 c0[0x20]

00000008: 1800000004005de2     mov b32 $r1 0x1

00000010: 6000000100000007     joinat 0x50

00000018: 50020800000fdd25     cas b32 $r1 g[$r0] 0 $r1

00000020: 190e0000fc11dc23     set $p0 eq s32 $r1 0

00000028: 40000000800001e7   B $p0 bra 0x48

00000030: 1800000004005de2     mov b32 $r1 0x1

00000038: 50020800000fdd25     cas b32 $r1 g[$r0] 0 $r1

00000040: 1a8e0000fc11dc23     set $p0 ne s32 $r1 0

00000048: 4003ffff800001e7   B $p0 bra 0x28

00000050: 2800400090009df4   B join mov b32 $r2 c0[0x24]

00000058: 8000000000205c85     ld b32 $r1 ca g[$r2]

00000060: 4800c00004105c03     add b32 $r1 $r1 0x1

00000068: 9000000000205c85     st b32 wb g[$r2] $r1

00000070: 507ff800000fdd05     exch b32 0 g[$r0] 0

00000078: 8000000000001de7     exit

Joinat instruction and join modifier should be removed, but I don’t know how to do it.

I’m reading CUDA by Example.
On books website you can download source code for examples inside the book.
The lock mechanism and examples are in appendix_a folder.
They successfully implemented mutex mechanism with the while loop.
My question is why their solution works, and mine not (I even tried the same lock construction as theirs).

I’m reading CUDA by Example.
On books website you can download source code for examples inside the book.
The lock mechanism and examples are in appendix_a folder.
They successfully implemented mutex mechanism with the while loop.
My question is why their solution works, and mine not (I even tried the same lock construction as theirs).

They use lock to synchronize between warps, not between threads. Only one thread per warp is active then lock/unlock is called.

for (int i=0; i<32; i++) {

            if ((tid % 32) == i) {

                lock();

                doSomething();

                unlock();

            }

They use lock to synchronize between warps, not between threads. Only one thread per warp is active then lock/unlock is called.

for (int i=0; i<32; i++) {

            if ((tid % 32) == i) {

                lock();

                doSomething();

                unlock();

            }

Mmm… I see now. So my problem is that i block whole warps instead of threads.

So I have to figure out another way to communicate :)

But at least I know I can use semaphores for communication between blocks of threaads,

and in some cases between warps.

Thanks for helping me understand the problem.

Mmm… I see now. So my problem is that i block whole warps instead of threads.

So I have to figure out another way to communicate :)

But at least I know I can use semaphores for communication between blocks of threaads,

and in some cases between warps.

Thanks for helping me understand the problem.

Use the Mutex to synchronize warps, then loop over the threads in the warp once the warp acquired the lock.

Use the Mutex to synchronize warps, then loop over the threads in the warp once the warp acquired the lock.

Just reorder your instructions so they don’t spin on the atomic itself. Don’t “keep trying until I have the lock”, but instead “try for the lock. If I got it, do my private stuff, then release it. Else go back and try again.” Notice the subtle difference in WHEN you test for lock attempt failure… but that difference prevents deadlock.

Something like this (untested) code.

bool needlock=true;

while (needlock) {

  if (0==atomicCAS(mutex, 0, 1)) {

     /* I have the lock */

     doSomething();

     /* release the lock */

     atomicExch(mutex, 0);

     needlock=false;

   }

}

To make a library, don’t make a lock() function, instead make an attemptLock() function which returns a boolean true if it succeeds, allowing you to use it easily in constructs like the above. Test for SUCCESS first, not failure.

Just reorder your instructions so they don’t spin on the atomic itself. Don’t “keep trying until I have the lock”, but instead “try for the lock. If I got it, do my private stuff, then release it. Else go back and try again.” Notice the subtle difference in WHEN you test for lock attempt failure… but that difference prevents deadlock.

Something like this (untested) code.

bool needlock=true;

while (needlock) {

  if (0==atomicCAS(mutex, 0, 1)) {

     /* I have the lock */

     doSomething();

     /* release the lock */

     atomicExch(mutex, 0);

     needlock=false;

   }

}

To make a library, don’t make a lock() function, instead make an attemptLock() function which returns a boolean true if it succeeds, allowing you to use it easily in constructs like the above. Test for SUCCESS first, not failure.

This does work :)

Although it doesn’t separate the code as I wanted to…

Thank you for your help.

This does work :)

Although it doesn’t separate the code as I wanted to…

Thank you for your help.