unable to run simple PTX kernel in MATLAB trying to implement mutex

Im trying to use mutex in my code with no luck. Matlab says that kernel reaches timeout.

kernel:

__global__ void getGrid(    int*   mask,

                int*  mutex  )  {

unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;

while( idx < 100 ){

while( atomicCAS( mutex, 0, 1 ) != 0 );

          mask[ idx ] = idx;

atomicExch( mutex, 0 );

        //*mutex = 0;

        //while( atomicCAS( mutex, idx, idx + 1 ) != idx );

idx += blockDim.x * gridDim.x;

}

}

MATLAB script:

mutex       = parallel.gpu.GPUArray.zeros('int32');

mask        = parallel.gpu.GPUArray.zeros(int32(256*7*20),int32(1),'int32');

kern0 = parallel.gpu.CUDAKernel('testkern.ptx', 'testkern.cu');

kern0.ThreadBlockSize = [256 1 1];

kern0.GridSize = [7*20 1];

[mask mutex] = feval(kern0, mask, mutex);

Commented atomicCAS works properly, but first one don’t. Any mistakes or… ?

XP x32, 280.26, GTX 560, cuda toolkit v. 4.0, i7 940

Copying my reply from another thread:

while( atomicCAS(&mutex, 0, 1) != 0);

is a straight deadlock in CUDA. At most one thread can grab the lock, all others have to spin in the loop. However, since all threads of a warp execute in lockstep, the thread that owns the lock cannot proceed to release the lock until all other threads do as well, which never happens.

I’ve made same mistake, thanks for explanations.