why this deadlocks? try to invoke a critical area

i want a critical area, so i use the atomic function

[codebox] while (!atomicCAS(&S,1,0)){

;

}

//…

S=0;

[/codebox]

initial S=1

however, it comes to a deadlock

i thought one thread should pass that while loop

but obivously it doesn’t work this way, why?

anyone knw about this?

thanks

It doesn’t work that way, because your code is executed by a SIMD (“SIMT”) machine.

The compiler will insert a branch reconvergence point just after the end of the loop, so that the processor has to wait for all threads of the warp to finish executing the loop before proceeding further.

This heuristic minimizes the amount of branch divergence in common code, but causes a deadlock here (S=0; is never executed).

You should use something like this instead :

while(1)

{

	if(atomicCAS(&S, 1, 0))

	{

		// inside critical section

		// ...

		S = 0;

		break;

	}

}

Not tested, but now the lock release is inside the loop, so it should not deadlock. At least on the current hardware generation…

(Note to compiler writers: the two code snippets above are not equivalent. External Image )

thanks for explaining

i tried u cold, it also comes to a deadlock:(

i use cuda-gdb to debug the code
it seems 1 thread is stay at “if(atomicCAS(&S, 1, 0))”
other threads stay at the end of while loop, maybe waiting for the join and starting next loop

No luck…
Is S in global or shared memory?

Can you post a minimal example?
Thanks.

I assume you mean S=1 after the while loop :)

I tested that and it works.

edit: never mind. It does not work.

yeap, S=1 after while loop
S is global variable

should i give up on this attempt?
because if it could work somehow, maybe nvidia would provide the pv operation at first place

These things were discussed long back in NV formums… We even got code that working…

Here is a code that demonstrate spinlock written by Tim murray long time back… But I am not sure this copy was modified by me OR is a pure copy… or whatever… It could be an untested version as well…

But it demonstrates how to go about locking in CUDA:

#include <stdio.h>

#include <stdlib.h>

__device__ volatile int lock = -1;

__device__ volatile int counter = 0;;

__global__ void spinlol()

{

  __shared__ int intraCTAlock;

if (!threadIdx.x && !threadIdx.y)

	intraCTAlock = -1;

  __syncthreads();

if (!threadIdx.x && !threadIdx.y)

	while (atomicCAS((int*)&lock, -1, blockIdx.x) != -1);

  __syncthreads();

/* ATTENTION: Changed 12 to threadIdx.x */

while (atomicCAS(&intraCTAlock, -1, threadIdx.x) != -1);

  counter++;

  __threadfence();

  atomicExch(&intraCTAlock, -1);

__syncthreads();

if (!threadIdx.x && !threadIdx.y)

	atomicExch((int*)&lock, -1);

}

int main(int argc, char** argv)

{

  int hostcounter = -1;

  dim3 grid, block;

if (argc != 3)

  {

	printf("Usage: <executable-name> N_Blocks N_Threads\n");

	return -1;

  }

grid.x = atoi(argv[1]);

  block.x = atoi(argv[2]);

	

  printf("Spawning kernel with %d blocks, %d threads_per_blocks\n", grid.x, block.x);

  spinlol<<<grid, block>>>();

  cudaThreadSynchronize();

  printf("err = %s\n", cudaGetErrorString(cudaGetLastError()));

  cudaMemcpyFromSymbol(&hostcounter, "counter", sizeof(int), 0, cudaMemcpyDeviceToHost);

  printf("counter = %d\n", hostcounter);

}

still…it comes to a deadlock :wacko:

where’s the original discussion? i cannot find it

honestly, if you’re trying to do this you’re probably going down the wrong path, but general rules of thumb are

  • don’t have multiple threads within a warp contending for a lock, that leads to all sorts of confusing issues for most people because inter-warp branches are not the same as intra-warp branches
  • avoid global memory contention as much as possible (e.g., if you need to have a critical section among all warps in all CTAs, do per-CTA shared memory locks then a global lock)
  • traditional threading primitives implemented with atomics are a pretty terrible idea, if you can avoid atomics as much as possible (or entirely) you can get a big perf win (and there are very interesting ways you can do this, and when I say big perf win, I mean on the order of 5-10x)

(“well,” you think, “it sounds like tim is speaking from experience!” oh yes, I am)

thanks,tim External Media

i knew this could be a wrong path at first place, just wanna check it by myself. because if it works in a efficient way, I would definitely gain perf on what I’m trying to accompolish.

[url=“http://forums.nvidia.com/index.php?showtopic=98444”]The Official NVIDIA Forums | NVIDIA

Check this out… Biiig thread… and lot of initial findinds were challenged and somewhere concluded in the thread… I dont have time to explore this… If you have, you may check out… If I remember things , I will post it for you.

The easiest way to do a per warp critical section is to just make all threads in the warp operate on the exact same data - hence, if you wanted only the first element of an array processed by a given warp, have all threads in the warp just process element 0. Since the warp is executed in lockstep, every thread will be computing identical results to if there were just one thread running.

So, to do a warp wide critical section, simply make all threads run through identical code paths with identical data input.