I am trying to use atomicCAS on CUDA 3.2 toolkit. The code below is simple implementation where I am just incrementing a variable in kernel and then copying back and printing it. The code works fine when I have multiple blocks and single thread within each block. Now if I increase the number of threads per blocks, program hands or terminates (like a timeout) with no output.
Can anyone please let me know where I am wrong. I tested the same code on CUDA 2.3 toolkit and it worked fine. Is there a different way to use it in CUDA 3.2 toolkit?
# include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <cuda.h>
#include <time.h>
#include <sys/types.h>
#include <sys/time.h>
#define BLOCKS 256
#define THREADSPERBLOCK 1
__global__ void cuda_kernel(int *lock,int *var)
{
while(atomicCAS(lock,0,1));
// atomicCAS(lock,0,1);
*var = *var+ 1;
*lock=0;
}
int main()
{
int lock=0;
int *lock_h=NULL;
int *lock_d=NULL;
int var=0;
int *var_h=NULL;
int *var_d=NULL;
lock_h = &lock;
var_h = &var;
// allocate memory for lock variable
if(cudaErrorMemoryAllocation == cudaMalloc((void **) &lock_d,sizeof(int)))
{
printf("Error allocating memory for file buffer in device\n");
return -1;
}
// copy lock in GPU device machine
cudaMemset(lock_d, '\0', sizeof(int));
cudaMemcpy(lock_d, lock_h,sizeof(int), cudaMemcpyHostToDevice);
// allocate memory for variable
if(cudaErrorMemoryAllocation == cudaMalloc((void **) &var_d,sizeof(int)))
{
printf("Error allocating memory for file buffer in device\n");
return -1;
}
// copy variable in GPU device machine
cudaMemset(var_d, '\0', sizeof(int));
cudaMemcpy(var_d, var_h,sizeof(int), cudaMemcpyHostToDevice);
// Run cuda code
cuda_kernel<<<BLOCKS, THREADSPERBLOCK>>>(lock_d,var_d);
cudaMemcpy(lock_h, lock_d, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(var_h, var_d, sizeof(int), cudaMemcpyDeviceToHost);
//printf("Lock = %d Lock_h = %d\n",lock,*lock_h);
printf("var = %d var_h = %d\n",var,*var_h);
return 0;
}
I think this is known issue of warp synchronization. There are some threads discussed on this topic.
You can try following code
__global__ void cuda_kernel(int *lock, volatile int *var)
{
int tid = threadIdx.x ; // tid is thread id, I assume 1-D thread-block here
if ( 0 == (tid & 31) ){ // use first thread of a warp to do atomicCAS
while(atomicCAS(lock,0,1));
*var = *var+ 1; // var is volatile, this is very important because L1-cache is incoherent
__threadfence(); // make sure write is done in L2-cache
atomicExch(lock,0); // *lock=0 is dangerous
}
}
Thanks for the update. I been reading the forums, however havent found a solution yet. I tried your code but it works only for single thread per block. When i increased it to 100/100/512 it gave me wrong output. Any further help/direction would be appreciated. Thanks.
This doesn’t work because the order of branches while using intra-warp divergence is undefined, and depending on the order that intra-warp branches may execute is a violation of the programming model. When a warp diverges, you don’t suddenly have two warps that you can context switch arbitrarily. Instead, you have one warp, and one path of the divergence has to be taken first. Let’s consider your code again:
while(atomicCAS(lock, 0, 1)); // this actually means:
divergence:
if (atomicCAS(lock, 0, 1) == 0) { // if we take the lock
goto other_stuff; // do something else
}
else {
goto divergence; // try to take the lock again
}
other_stuff:
...
Your warp splits into two parts that must execute in a particular order; namely, you always want to run the part of the warp that hits other_stuff. The compiler might decide to put that first, or maybe you happen to get that first, but maybe the hardware and the compiler happen to run the part of the warp that takes the goto to divergence first. Hence, deadlock.
@tmurray: Is there a way I can address this issue? I saw in one of the threads you had even provided the code to implement the lock. I tested that code but I couldnt make it work.
This lock functionality is part of the bigger problem I am trying to solve. I am using this to update a hashtable when adding or updating the node. Now the data is coming in a big global memory and I need to handle it within threads & blocks and then update the final hash table.
My idea is to create wrapper functions for lock/unlock so that I can use it by locking a variable address to update the hash table.
In one of threads on the forum, this was another implementation mentioned.
bool needlock=true;
while (needlock)
{
if (0==atomicCAS(lock, 0, 1))
{
// I have the lock
*var = *var+ 1;
// release the lock
atomicExch(lock, 0);
needlock=false;
}
}
This worked fine for 512 blocks and just 10 threads. If I increase threads per block to more than 10, it just hangs.
I do get the value of 1024. But I was expecting 25600 i.e (256 blocks)(100 threads/block) each incrementing the value of var once. With this wrap issue it means 31 threads didnt do their work every time some thread got a lock. How can I make all threads i.e BLOCKSTHREADSPERBLOCK do work using atomics even if their is a performance hit.
__global__ void cuda_kernel(int *lock, volatile int *var)
{
int tid = threadIdx.x ; // tid is thread id, I assume 1-D thread-block here
int laneid = (tid & 31);
for( int i = 0 ; i < 32 ; i++){
if ( i == laneid ){
while(atomicCAS(lock,0,1));
*var = *var+ 1;
__threadfence();
atomicExch(lock,0);
}
}
}