mutual exclusion problem using Atomic functions

Hello everyone,

I have an other similar program in CUDA without mutual exclusion which works well, but I would like to see performances by using mutual exclusion with atomic functions since I have a 8600M GT with capability 1.1.

n is the size of my problem and my kernel is called n*(n-1)/2 times. Implementing a mutual exclusion with atomic functions for small values of n works well. But when I try to increase the size of n, the screen flashes for few seconds, stopping execution and sometimes it causes a reboot.

device__ int mutex=1;

__global__ void Calculation(const int* A, const int* B, const int* S, int n, const int old_value, int* best_value )


  int id = blockIdx.x * blockDim.x + threadIdx.x;

if (id >= n*(n-1)/2)




	int first;

	int second;

	int temp;


	first = (n-1) - floorf( ( (sqrtf( 8 * ((n*(n-1)/2) - temp - 1) + 1 +0.1f)) -1 ) / 2  ) - 1;


	second = temp - move_first * (n-1)+ move_first * (move_first + 1)/2 + 1;


	temp = old_value + compute_value(n,A,B,S,first,second); // do some calculation

	// implementing mutual exclusion

	if (temp <= best_move[0]){

	  int waiting=1; 

	  while (waiting) {

		if (atomicCAS(&mutex, 1, 0)) { 


		if (temp < best_move[0]){

		  best_value[0] = temp;

		  best_value[1] = first;

		  best_value[2] = second;



		else if ( temp == best_value[0] ){

		  if ( (first < best_value[1]) || ( (first == best_value[1]) && (second < best_value[2]) ) ) {

			best_value[1] = first;

			best_value[2] = second;






		atomicExch(&mutex, 1);







I don’t understand the problem. I should have missed something about atomic functions. Thank you for helping me.

ahhhhhh no don’t try to make a mutex, it’s always the wrong answer

You’re either A. deadlocking the card and/or B. triggering the watchdog timer. Both of these are bad! B should be handled correctly (but sometimes still breaks right now depending on the driver version, phase of moon, etc), but A is Very Bad right now. Atomics are really slow. Block ordering is undetermined, so if you ever have a situation where all of the blocks on the GPU waiting for some block that may not have executed yet, everything is going to hell.

Basically what I’m trying to impart is that this is a bad idea and even if you think you get it working somebody will try it with a different GPU and it will blow up.

I know block barriers or inter-block communication is bad, because it violates the fundamental block-independence property (among other reasons).

But the more I think about it, mutexes seem to be a valid thing to try to do. There are many applications I can think of where keeping a single global data structure makes more sense than one per block followed by reduction. For example passing a “heap” and doing a poor-man’s malloc (not that that’s ever better than the alternative, but as an example). Or perhaps printing strings to a debug buffer which is copied back to the host after the kernel finishes. It seems to me that a correctly-implemented mutex should be a legitimate thing to want to do. And it doesn’t seem inherently dangerous, or at least no more dangerous (deadlock) than in a traditional multithreading situation.


Thank you for your comments !

I am quiete new to CUDA programming. In fact, I have already implemented a version which do a CPU post treatment and also an other version using reduction for finding minimum values.

But I really want to implement mutex exclusion version for comparing performances with the two versions mentionned above. I don’t know what is the problem of my code. Is the code seems correct ? or is there a better way to implemente the mutex part ?

Thank you.

No, you should listen to tmurray, trying to implement mutexes is a loss of time . Moreover simple Atomic operations often allow to do what you want to implement with a mutex. For example handling of single global data structure like the one you talk about can be implemented with atomic ops without creating a mutex.