Atomic Operations on GTX 280 ?

I am using GTX 280 card which has compute capability of 1.3.
I have also installed cuda-3.0beta1 version of CUDA computing environment.

From cuda reference manual, I came to know that GTX card support atomic operations,
So when I used atomic operations for inter block sync(), atomic was actually not working.

I am compiling with -arch sm_11 flag. And compilation is successful.
But the kernel is not launching due to unspecified failure.

Also below conditional compilation printf not using atomics.
#ifndef CUDA_NO_SM_11_ATOMIC_INTRINSICS
printf(“WARNING! Not using atomics!\n”);
#endif

This is my inter block sync() code.

device_ int g_mutex; // not declaring as volatile as atomicAdd doesn’t accept volatile variable
device void __gpu_sync()
{

int tx = threadIdx.x;
int ty = threadIdx.y;
int linBlockInd = blockIdx.y ;
int tid_in_block = ty*blockDim.x+tx;

int cas_variable=1, val=0, to_execute=1, ret=0;
int dummy=1;
__syncthreads();

if (tid_in_block == 0) {
    atomicAdd(&g_mutex, 1);
    __threadfence();
    while(g_mutex != NO_OF_GRID) {
       ret=atomicCAS(&cas_variable, to_execute,val); // just not to compile away while loop.
        if (ret == 0) {
            val+= 1;
        } else {
            val+= 2;
        }
    }
}
__syncthreads();

}

Can someone please help me out with resolving this issue.

Thanks in Advance

interblock synchronisation is evil. I was succesfull in this domain, but performance was worse than simply calling another kernel.

Anyway, I assume you reset g_mutex to 0 before calling the kernel, you use it only once and you launch no more kernels than the GPU can handle?
Mind a bug/misbehaviour of the driver scheduler: http://forums.nvidia.com/index.php?showtopic=150567 (which for some reason NVIDIA guys seem to ignore :( )

Thanks for your reply.

Well, it worked when I removed atomicCAS and used a volatile variable instead.

But as you mentioned it is hitting the performance.