atomicCAS for mutiple blocks & mutiple threads - CUDA 3.2 - Fedora 10

I am trying to use atomicCAS on CUDA 3.2 toolkit. The code below is simple implementation where I am just incrementing a variable in kernel and then copying back and printing it. The code works fine when I have multiple blocks and single thread within each block. Now if I increase the number of threads per blocks, program hands or terminates (like a timeout) with no output.

Can anyone please let me know where I am wrong. I tested the same code on CUDA 2.3 toolkit and it worked fine. Is there a different way to use it in CUDA 3.2 toolkit?

# include <stdio.h>

#include <stdlib.h>

#include <stdint.h>

#include <string.h>

#include <cuda.h>

#include <time.h>

#include <sys/types.h>

#include <sys/time.h>

#define BLOCKS								256

#define THREADSPERBLOCK				1

__global__ void cuda_kernel(int *lock,int *var)

{

	while(atomicCAS(lock,0,1)); 

//	atomicCAS(lock,0,1); 

	

	*var = *var+ 1;

	

	*lock=0;

}

int main()

{

	int lock=0;

	int *lock_h=NULL;

	int *lock_d=NULL;

	

	int var=0;

	int *var_h=NULL;

	int *var_d=NULL;

	

	lock_h = &lock;

	var_h = &var;

	

	// allocate memory for lock variable

	if(cudaErrorMemoryAllocation == cudaMalloc((void **) &lock_d,sizeof(int)))

	 {

	 	printf("Error allocating memory for file buffer in device\n"); 

	 	return -1;	

	 }

	 // copy lock in GPU device machine

	 cudaMemset(lock_d, '\0', sizeof(int));

	 cudaMemcpy(lock_d, lock_h,sizeof(int), cudaMemcpyHostToDevice);

	// allocate memory for variable

	if(cudaErrorMemoryAllocation == cudaMalloc((void **) &var_d,sizeof(int)))

	 {

	 	printf("Error allocating memory for file buffer in device\n"); 

	 	return -1;	

	 }

	 // copy variable in GPU device machine

	 cudaMemset(var_d, '\0', sizeof(int));

	 cudaMemcpy(var_d, var_h,sizeof(int), cudaMemcpyHostToDevice);

	// Run cuda code

	cuda_kernel<<<BLOCKS, THREADSPERBLOCK>>>(lock_d,var_d); 

		

	cudaMemcpy(lock_h, lock_d, sizeof(int), cudaMemcpyDeviceToHost);

	cudaMemcpy(var_h, var_d, sizeof(int), cudaMemcpyDeviceToHost);				

	

	//printf("Lock = %d Lock_h = %d\n",lock,*lock_h);

	printf("var = %d var_h = %d\n",var,*var_h);

	

   	return 0;

}
__global__ void cuda_kernel(int *lock,int *var){        

    while(atomicCAS(lock,0,1)); 

    //      atomicCAS(lock,0,1);                 

    *var = *var+ 1;                

    *lock=0;

}

I think this is known issue of warp synchronization. There are some threads discussed on this topic.

You can try following code

__global__ void cuda_kernel(int *lock, volatile int *var)

{

    int tid = threadIdx.x ; // tid is thread id, I assume 1-D thread-block here           

    if ( 0 == (tid & 31) ){ // use first thread of a warp to do atomicCAS

        while(atomicCAS(lock,0,1));

        *var = *var+ 1; // var is volatile, this is very important because L1-cache is incoherent

        __threadfence(); // make sure write is done in L2-cache 

        atomicExch(lock,0);  // *lock=0 is dangerous 

    }

}

Thanks for the update. I been reading the forums, however havent found a solution yet. I tried your code but it works only for single thread per block. When i increased it to 100/100/512 it gave me wrong output. Any further help/direction would be appreciated. Thanks.

Here lies dragons.

This doesn’t work because the order of branches while using intra-warp divergence is undefined, and depending on the order that intra-warp branches may execute is a violation of the programming model. When a warp diverges, you don’t suddenly have two warps that you can context switch arbitrarily. Instead, you have one warp, and one path of the divergence has to be taken first. Let’s consider your code again:

while(atomicCAS(lock, 0, 1)); // this actually means:

divergence:

if (atomicCAS(lock, 0, 1) == 0) { // if we take the lock

  goto other_stuff; // do something else

}

else {

  goto divergence; // try to take the lock again

}

other_stuff:

...

Your warp splits into two parts that must execute in a particular order; namely, you always want to run the part of the warp that hits other_stuff. The compiler might decide to put that first, or maybe you happen to get that first, but maybe the hardware and the compiler happen to run the part of the warp that takes the goto to divergence first. Hence, deadlock.

@tmurray: Is there a way I can address this issue? I saw in one of the threads you had even provided the code to implement the lock. I tested that code but I couldnt make it work.

This lock functionality is part of the bigger problem I am trying to solve. I am using this to update a hashtable when adding or updating the node. Now the data is coming in a big global memory and I need to handle it within threads & blocks and then update the final hash table.

My idea is to create wrapper functions for lock/unlock so that I can use it by locking a variable address to update the hash table.

In one of threads on the forum, this was another implementation mentioned.

bool needlock=true;

while (needlock) 

{

  if (0==atomicCAS(lock, 0, 1)) 

  {

     // I have the lock 

      *var = *var+ 1;

     // release the lock 

     atomicExch(lock, 0);

     needlock=false;

   }

}

This worked fine for 512 blocks and just 10 threads. If I increase threads per block to more than 10, it just hangs.

#define BLOCKS 256
#define THREADSPERBLOCK 100

there are 4 warps per block, so var_h = 1024 = (256 blocks) x (4 warps per block).

What is your expected number of var_h?

@LSChien:
For
#define BLOCKS 256
#define THREADSPERBLOCK 100

I do get the value of 1024. But I was expecting 25600 i.e (256 blocks)(100 threads/block) each incrementing the value of var once. With this wrap issue it means 31 threads didnt do their work every time some thread got a lock. How can I make all threads i.e BLOCKSTHREADSPERBLOCK do work using atomics even if their is a performance hit.

use a for-loop to sweep all threads in a warp.

__global__ void cuda_kernel(int *lock, volatile int *var)

{    

    int tid = threadIdx.x ; // tid is thread id, I assume 1-D thread-block here               

    int laneid = (tid & 31);

    for( int i = 0 ; i < 32 ; i++){

        if ( i == laneid ){  

            while(atomicCAS(lock,0,1));        

            *var = *var+ 1;   

            __threadfence();  

            atomicExch(lock,0);   

        }

    }

}

but performance is very bad.