Try to use lock and unlock in CUDA

I try to use atomicCAS and atomicExch to simulate lock and unlock functions in troditional thread and block concurrcy programming. But I found some strange problems.
Here is my code.
The lock only works between thread block but not threads. It seems will cause dead lock between threads.

__global__ void lockAdd(int*val, int* mutex) {
	while (0 != (atomicCAS(mutex, 0, 1))) {}//Similar to spin lock
	(*val)++;//all threads add one to the value
	atomicExch(mutex, 0);//unlock
}

int main() {
	int* mutex;//all threads share on mutex.
	cudaMallocManaged((void**)&mutex, sizeof(int));
	*mutex = 0;
	int* val;
	cudaMallocManaged((void**)&val, sizeof(int));//val is on unified memory
	*val = 0;
	lockAdd << <1024, 1024 >> > (val,mutex);//1024 blocks,1024 threads per block
	//lockAdd << <1024, 1 >> > (val,mutex);//If I only launch 1 thread per block, it works perfectly
	cudaDeviceSynchronize();
	std::cout << *val << std::endl;//the idea output should be 1 million. But it seems there is a dead lock and the driver is crashed.
	cudaFree(val);
	cudaFree(mutex);
}

Yes, others have run into this.

The best suggestion is to not use a mutex at all. For example, many algorithms that depend on a mutex can be recrafted to use a parallel reduction methodology.

A better suggestion than thread-level locking is to arrange locking at the threadblock level, and then negotiate for access within a threadblock using ordinary synchronization means:

https://stackoverflow.com/questions/18963293/cuda-atomics-change-flag/18968893#18968893

Threads within a warp negotiating for a lock can be quite challenging due to the GPU warp-based execution:

https://stackoverflow.com/questions/2021019/implementing-a-critical-section-in-cuda
https://stackoverflow.com/questions/26221782/how-to-implement-critical-section-in-cuda
https://stackoverflow.com/questions/31194291/cuda-mutex-why-deadlock