why this deadlocks? try to invoke a critical area

flyfrog · November 3, 2009, 2:18am

i want a critical area, so i use the atomic function

[codebox] while (!atomicCAS(&S,1,0)){

;

}

//…

S=0;

[/codebox]

initial S=1

however, it comes to a deadlock

i thought one thread should pass that while loop

but obivously it doesn’t work this way, why?

anyone knw about this?

thanks

Sylvain_Collange · November 3, 2009, 8:03am

It doesn’t work that way, because your code is executed by a SIMD (“SIMT”) machine.

The compiler will insert a branch reconvergence point just after the end of the loop, so that the processor has to wait for all threads of the warp to finish executing the loop before proceeding further.

This heuristic minimizes the amount of branch divergence in common code, but causes a deadlock here (S=0; is never executed).

You should use something like this instead :

while(1)

{

	if(atomicCAS(&S, 1, 0))

	{

		// inside critical section

		// ...

		S = 0;

		break;

	}

}

Not tested, but now the lock release is inside the loop, so it should not deadlock. At least on the current hardware generation…

(Note to compiler writers: the two code snippets above are not equivalent. External Image )

flyfrog · November 3, 2009, 2:59pm

thanks for explaining

i tried u cold, it also comes to a deadlock:(

i use cuda-gdb to debug the code
it seems 1 thread is stay at “if(atomicCAS(&S, 1, 0))”
other threads stay at the end of while loop, maybe waiting for the join and starting next loop

Sylvain_Collange · November 3, 2009, 3:22pm

No luck…
Is S in global or shared memory?

Can you post a minimal example?
Thanks.

gshi · November 3, 2009, 3:34pm

I assume you mean S=1 after the while loop :)

I tested that and it works.

edit: never mind. It does not work.

flyfrog · November 4, 2009, 2:22am

yeap, S=1 after while loop
S is global variable

should i give up on this attempt?
because if it could work somehow, maybe nvidia would provide the pv operation at first place

Sarnath · November 4, 2009, 11:02am

These things were discussed long back in NV formums… We even got code that working…

Here is a code that demonstrate spinlock written by Tim murray long time back… But I am not sure this copy was modified by me OR is a pure copy… or whatever… It could be an untested version as well…

But it demonstrates how to go about locking in CUDA:

#include <stdio.h>

#include <stdlib.h>

__device__ volatile int lock = -1;

__device__ volatile int counter = 0;;

__global__ void spinlol()

{

  __shared__ int intraCTAlock;

if (!threadIdx.x && !threadIdx.y)

	intraCTAlock = -1;

  __syncthreads();

if (!threadIdx.x && !threadIdx.y)

	while (atomicCAS((int*)&lock, -1, blockIdx.x) != -1);

  __syncthreads();

/* ATTENTION: Changed 12 to threadIdx.x */

while (atomicCAS(&intraCTAlock, -1, threadIdx.x) != -1);

  counter++;

  __threadfence();

  atomicExch(&intraCTAlock, -1);

__syncthreads();

if (!threadIdx.x && !threadIdx.y)

	atomicExch((int*)&lock, -1);

}

int main(int argc, char** argv)

{

  int hostcounter = -1;

  dim3 grid, block;

if (argc != 3)

  {

	printf("Usage: <executable-name> N_Blocks N_Threads\n");

	return -1;

  }

grid.x = atoi(argv[1]);

  block.x = atoi(argv[2]);

	

  printf("Spawning kernel with %d blocks, %d threads_per_blocks\n", grid.x, block.x);

  spinlol<<<grid, block>>>();

  cudaThreadSynchronize();

  printf("err = %s\n", cudaGetErrorString(cudaGetLastError()));

  cudaMemcpyFromSymbol(&hostcounter, "counter", sizeof(int), 0, cudaMemcpyDeviceToHost);

  printf("counter = %d\n", hostcounter);

}

flyfrog · November 5, 2009, 6:40am

still…it comes to a deadlock :wacko:

where’s the original discussion? i cannot find it

tmurray · November 5, 2009, 6:53am

honestly, if you’re trying to do this you’re probably going down the wrong path, but general rules of thumb are

don’t have multiple threads within a warp contending for a lock, that leads to all sorts of confusing issues for most people because inter-warp branches are not the same as intra-warp branches
avoid global memory contention as much as possible (e.g., if you need to have a critical section among all warps in all CTAs, do per-CTA shared memory locks then a global lock)
traditional threading primitives implemented with atomics are a pretty terrible idea, if you can avoid atomics as much as possible (or entirely) you can get a big perf win (and there are very interesting ways you can do this, and when I say big perf win, I mean on the order of 5-10x)

(“well,” you think, “it sounds like tim is speaking from experience!” oh yes, I am)

flyfrog · November 6, 2009, 1:32am

honestly, if you’re trying to do this you’re probably going down the wrong path, but general rules of thumb are

don’t have multiple threads within a warp contending for a lock, that leads to all sorts of confusing issues for most people because inter-warp branches are not the same as intra-warp branches

avoid global memory contention as much as possible (e.g., if you need to have a critical section among all warps in all CTAs, do per-CTA shared memory locks then a global lock)

traditional threading primitives implemented with atomics are a pretty terrible idea, if you can avoid atomics as much as possible (or entirely) you can get a big perf win (and there are very interesting ways you can do this, and when I say big perf win, I mean on the order of 5-10x)

(“well,” you think, “it sounds like tim is speaking from experience!” oh yes, I am)

thanks,tim External Media

i knew this could be a wrong path at first place, just wanna check it by myself. because if it works in a efficient way, I would definitely gain perf on what I’m trying to accompolish.

Sarnath · November 6, 2009, 5:35am

[url=“http://forums.nvidia.com/index.php?showtopic=98444”]The Official NVIDIA Forums | NVIDIA

Check this out… Biiig thread… and lot of initial findinds were challenged and somewhere concluded in the thread… I dont have time to explore this… If you have, you may check out… If I remember things , I will post it for you.

Keldor314 · November 6, 2009, 5:48am

The easiest way to do a per warp critical section is to just make all threads in the warp operate on the exact same data - hence, if you wanted only the first element of an array processed by a given warp, have all threads in the warp just process element 0. Since the warp is executed in lockstep, every thread will be computing identical results to if there were just one thread running.

So, to do a warp wide critical section, simply make all threads run through identical code paths with identical data input.

Topic		Replies	Views
Atomic Operations in CUDA CUDA Programming and Performance	5	29237	June 9, 2009
Shared mem atomics Help needed to Fix this hang CUDA Programming and Performance	12	11761	June 8, 2009
My lock and unlock function. Help~~~ CUDA Programming and Performance	5	3461	April 30, 2010
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	19760	January 21, 2010
Problem with lock using atomicCAS CUDA Programming and Performance	3	3549	July 19, 2014
Implementing global lock (critical section) on 1-block cuda process CUDA Programming and Performance cuda , kernel	1	166	June 4, 2024
Understanding a spinlock implementation by Robert Crovella CUDA Programming and Performance	6	1626	September 26, 2023
Problem of Hash Table Lock in CUDA CUDA Programming and Performance	6	1278	July 16, 2018
DeadLock on this piece of code. Require your help. CUDA Programming and Performance	13	4738	March 17, 2010
atomic locks CUDA Programming and Performance	15	12917	January 27, 2012

why this deadlocks? try to invoke a critical area

Related topics