atomicCAS for mutiple blocks & mutiple threads - CUDA 3.2 - Fedora 10

singhk · April 24, 2011, 1:42am

I am trying to use atomicCAS on CUDA 3.2 toolkit. The code below is simple implementation where I am just incrementing a variable in kernel and then copying back and printing it. The code works fine when I have multiple blocks and single thread within each block. Now if I increase the number of threads per blocks, program hands or terminates (like a timeout) with no output.

Can anyone please let me know where I am wrong. I tested the same code on CUDA 2.3 toolkit and it worked fine. Is there a different way to use it in CUDA 3.2 toolkit?

# include <stdio.h>

#include <stdlib.h>

#include <stdint.h>

#include <string.h>

#include <cuda.h>

#include <time.h>

#include <sys/types.h>

#include <sys/time.h>

#define BLOCKS								256

#define THREADSPERBLOCK				1

__global__ void cuda_kernel(int *lock,int *var)

{

	while(atomicCAS(lock,0,1)); 

//	atomicCAS(lock,0,1); 

	

	*var = *var+ 1;

	

	*lock=0;

}

int main()

{

	int lock=0;

	int *lock_h=NULL;

	int *lock_d=NULL;

	

	int var=0;

	int *var_h=NULL;

	int *var_d=NULL;

	

	lock_h = &lock;

	var_h = &var;

	

	// allocate memory for lock variable

	if(cudaErrorMemoryAllocation == cudaMalloc((void **) &lock_d,sizeof(int)))

	 {

	 	printf("Error allocating memory for file buffer in device\n"); 

	 	return -1;	

	 }

	 // copy lock in GPU device machine

	 cudaMemset(lock_d, '\0', sizeof(int));

	 cudaMemcpy(lock_d, lock_h,sizeof(int), cudaMemcpyHostToDevice);

	// allocate memory for variable

	if(cudaErrorMemoryAllocation == cudaMalloc((void **) &var_d,sizeof(int)))

	 {

	 	printf("Error allocating memory for file buffer in device\n"); 

	 	return -1;	

	 }

	 // copy variable in GPU device machine

	 cudaMemset(var_d, '\0', sizeof(int));

	 cudaMemcpy(var_d, var_h,sizeof(int), cudaMemcpyHostToDevice);

	// Run cuda code

	cuda_kernel<<<BLOCKS, THREADSPERBLOCK>>>(lock_d,var_d); 

		

	cudaMemcpy(lock_h, lock_d, sizeof(int), cudaMemcpyDeviceToHost);

	cudaMemcpy(var_h, var_d, sizeof(int), cudaMemcpyDeviceToHost);				

	

	//printf("Lock = %d Lock_h = %d\n",lock,*lock_h);

	printf("var = %d var_h = %d\n",var,*var_h);

	

   	return 0;

}

LSChien · April 24, 2011, 2:49am

__global__ void cuda_kernel(int *lock,int *var){        

    while(atomicCAS(lock,0,1)); 

    //      atomicCAS(lock,0,1);                 

    *var = *var+ 1;                

    *lock=0;

}

I think this is known issue of warp synchronization. There are some threads discussed on this topic.

You can try following code

__global__ void cuda_kernel(int *lock, volatile int *var)

{

    int tid = threadIdx.x ; // tid is thread id, I assume 1-D thread-block here           

    if ( 0 == (tid & 31) ){ // use first thread of a warp to do atomicCAS

        while(atomicCAS(lock,0,1));

        *var = *var+ 1; // var is volatile, this is very important because L1-cache is incoherent

        __threadfence(); // make sure write is done in L2-cache 

        atomicExch(lock,0);  // *lock=0 is dangerous 

    }

}

singhk · April 24, 2011, 3:11am

__global__ void cuda_kernel(int *lock,int *var){        

    while(atomicCAS(lock,0,1)); 

    //      atomicCAS(lock,0,1);                 

    *var = *var+ 1;                

    *lock=0;

}

I think this is known issue of warp synchronization. There are some threads discussed on this topic.

You can try following code

__global__ void cuda_kernel(int *lock, volatile int *var)

{

    int tid = threadIdx.x ; // tid is thread id, I assume 1-D thread-block here           

    if ( 0 == (tid & 31) ){ // use first thread of a warp to do atomicCAS

        while(atomicCAS(lock,0,1));

        *var = *var+ 1; // var is volatile, this is very important because L1-cache is incoherent

        __threadfence(); // make sure write is done in L2-cache 

        atomicExch(lock,0);  // *lock=0 is dangerous 

    }

}

Thanks for the update. I been reading the forums, however havent found a solution yet. I tried your code but it works only for single thread per block. When i increased it to 100/100/512 it gave me wrong output. Any further help/direction would be appreciated. Thanks.

tmurray · April 24, 2011, 3:54am

Here lies dragons.

This doesn’t work because the order of branches while using intra-warp divergence is undefined, and depending on the order that intra-warp branches may execute is a violation of the programming model. When a warp diverges, you don’t suddenly have two warps that you can context switch arbitrarily. Instead, you have one warp, and one path of the divergence has to be taken first. Let’s consider your code again:

while(atomicCAS(lock, 0, 1)); // this actually means:

divergence:

if (atomicCAS(lock, 0, 1) == 0) { // if we take the lock

  goto other_stuff; // do something else

}

else {

  goto divergence; // try to take the lock again

}

other_stuff:

...

Your warp splits into two parts that must execute in a particular order; namely, you always want to run the part of the warp that hits other_stuff. The compiler might decide to put that first, or maybe you happen to get that first, but maybe the hardware and the compiler happen to run the part of the warp that takes the goto to divergence first. Hence, deadlock.

singhk · April 24, 2011, 6:23am

@tmurray: Is there a way I can address this issue? I saw in one of the threads you had even provided the code to implement the lock. I tested that code but I couldnt make it work.

This lock functionality is part of the bigger problem I am trying to solve. I am using this to update a hashtable when adding or updating the node. Now the data is coming in a big global memory and I need to handle it within threads & blocks and then update the final hash table.

My idea is to create wrapper functions for lock/unlock so that I can use it by locking a variable address to update the hash table.

In one of threads on the forum, this was another implementation mentioned.

bool needlock=true;

while (needlock) 

{

  if (0==atomicCAS(lock, 0, 1)) 

  {

     // I have the lock 

      *var = *var+ 1;

     // release the lock 

     atomicExch(lock, 0);

     needlock=false;

   }

}

This worked fine for 512 blocks and just 10 threads. If I increase threads per block to more than 10, it just hangs.

LSChien · April 24, 2011, 8:30am

#define BLOCKS 256
#define THREADSPERBLOCK 100

there are 4 warps per block, so var_h = 1024 = (256 blocks) x (4 warps per block).

What is your expected number of var_h?

singhk · April 24, 2011, 4:34pm

@LSChien:
For
#define BLOCKS 256
#define THREADSPERBLOCK 100

I do get the value of 1024. But I was expecting 25600 i.e (256 blocks)(100 threads/block) each incrementing the value of var once. With this wrap issue it means 31 threads didnt do their work every time some thread got a lock. How can I make all threads i.e BLOCKSTHREADSPERBLOCK do work using atomics even if their is a performance hit.

LSChien · April 25, 2011, 12:19am

use a for-loop to sweep all threads in a warp.

__global__ void cuda_kernel(int *lock, volatile int *var)

{    

    int tid = threadIdx.x ; // tid is thread id, I assume 1-D thread-block here               

    int laneid = (tid & 31);

    for( int i = 0 ; i < 32 ; i++){

        if ( i == laneid ){  

            while(atomicCAS(lock,0,1));        

            *var = *var+ 1;   

            __threadfence();  

            atomicExch(lock,0);   

        }

    }

}

but performance is very bad.

Topic		Replies	Views
atomicCAS issue (possible deadlock) CUDA Programming and Performance	5	3327	October 26, 2011
Problem with lock using atomicCAS CUDA Programming and Performance	3	3645	July 19, 2014
atomic locks CUDA Programming and Performance	15	13106	January 27, 2012
atomicCAS CUDA Programming and Performance	8	4137	July 4, 2011
atomiccas usage Legacy PGI Compilers	2	3733	December 25, 2014
Try to use lock and unlock in CUDA CUDA Programming and Performance	1	19774	June 14, 2017
atomicCAS() doesn't work! CUDA Programming and Performance	4	9236	July 22, 2010
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	20238	January 21, 2010
questions about using atomicCAS as a lock CUDA Programming and Performance	0	1367	November 10, 2011
Confusing results while using atomicCAS() on shared variables CUDA Programming and Performance	1	976	August 5, 2009

atomicCAS for mutiple blocks & mutiple threads - CUDA 3.2 - Fedora 10

Related topics