atomicCAS mutex not working on 2080ti?

All I want is implement a mutex on GPU. I find atomicCAS/atomicExch can be used to implement such a mutex.
Everything works fine, when I worked with old GPUs. Error occurs when I run my code on 2080ti.

Here is the code I use. This code use mutex and atomicAdd separately to do the same computation.

#include <stdio.h>
#include <stdlib.h>

#include <cuda.h>
#include <cuda_runtime.h>
#define data_t  int32_t

__global__ void method(
    data_t* data,
    unsigned int* lock
){
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int* mutex = &lock[0];

    atomicAdd(&data[1], i);

    bool blocked = true;
    while(blocked){
        if(atomicCAS(mutex, 0u, 1u)==0u){
            data[0] += i;
            data[2] += 1;
            blocked = false;
            atomicExch(mutex, 0u);
        }
    }
}

int main(int argc, char* argv[]){
    for(int i=0;i<argc;++i)
        printf("argc[%d] %s\n", i, argv[i]);
    int len = 69;
	int dev_id = 0;
    if(argc>1){
        len = atoi(argv[1]);
    }
    if(argc>2){
        dev_id = atoi(argv[2]);
    }
    cudaSetDevice(dev_id);
    data_t* hst_a = (data_t*)malloc(4*sizeof(data_t));
    data_t* dev_a;
    unsigned int* hst_lock = (unsigned int*)malloc(4*sizeof(unsigned int));
    unsigned int* dev_lock;
    cudaMalloc(&dev_a, 4*sizeof(data_t));
    cudaMalloc(&dev_lock, 4*sizeof(unsigned int));
    
    memset(hst_a, 0, 4*sizeof(data_t));
    memset(hst_lock, 0u, 4*sizeof(unsigned int));
    cudaMemcpy(dev_a, hst_a, 4*sizeof(data_t), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_lock, hst_lock, 4*sizeof(unsigned int), cudaMemcpyHostToDevice);

    method<<<len, 1>>>(dev_a, dev_lock);
    cudaMemcpy(hst_a, dev_a, 4*sizeof(data_t), cudaMemcpyDeviceToHost);
    for(int i=0;i<4;++i)
        printf("%d, ", hst_a[i]);
    printf("\n");
}

This code works fine on K40c and GTX970. But on 2080ti, atomicAdd works as expected, atomicCAS only works with very small grid size(~70).

Tested on 2080ti with cuda-10.0(driver-410), cuda10.1(driver-418), cuda10.2(driver-440). None of them works.

Could anyone tell me where the bug is(hardware/driver/cuda/code)? and how to fix it.

wouldn’t the memory transactions data[0] += i; and data[2] += 1; require some kind of thread fence?
Otherwise it is not guaranteed that the data being written by one thread is visible to all other threads (and in other thread blocks)

the related CUDA documentation is here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions

and please have a look at the related thread here https://devtalk.nvidia.com/default/topic/1020870/-errata-for-cuda-by-example-is-atomiccas-safe-to-simulate-lock-even-with-__threadfence-/

Indeed. Kepler and Maxwell have L1 disabled by default for global loads/stores.

Volta/Turing do not.

A suitably placed __threadfence() will fix the problem, according to my testing.

Note that I still would not recommend this particular approach for negotiating for a mutex/lock within a warp. That method is fraught with peril. Instead, negotiate for mutex/lock at the threadblock level (using one thread per threadblock to negotiate), then use threadblock level mechanisms (e.g. shared memory, __syncthreads(), atomics, etc.) to manage concurrency within a threadblock. The particular code sample here however is only launching one thread per block.

Thanks to cbuchner1 and Robert_Crovella. The mutex in sample code is correct, the problem is how I load/store data to global memory. Put a threadfence after store fix this bug. I also find that change data to volatile instead of adding threadfence works too.

AFAIK, volatile disables cache for read/write, threadfence garuantee previous writes is visible to following reads. In the buggy sample code, some store operations are cached. Disabling cache happens to have the same effect as threadfence. Is my explaination correct?

Beside, I find Professional CUDA C Programing says ‘on the GPU only memory load operations can be cached; memory store operations cannot be cached’. Does Volta/Turing first start to cache store operations?

New kernel are list below. Is this a “better” practice?

template<int thread_num>
__global__ void method(
    volatile data_t* data,
    unsigned int* lock,
    int len
){
    int bid = blockIdx.x * blockDim.x; // base-Index
    int tid = threadIdx.x;             // thread-Index
    int i = bid + tid;                 // global-Index
    if(i>=len)
        return;
    unsigned int* mutex = &lock[0];

    atomicAdd((data_t*)(&data[1]), i);

    __shared__ data_t share_data_0[thread_num], share_data_1[thread_num];
	// initialize share memory
    __syncthreads();

	// do computation in each thread simultaneously
    share_data_0[tid] = i;
    share_data_1[tid] = 1;

	// make write visibel to other threads
    __syncthreads();

    if(tid==0){	// reduction in first thread
        for(int j=1;j<thread_num&&j+bid<len;++j){
        share_data_0[0] += share_data_0[j];
        share_data_1[0] += share_data_1[j];
        }

    // write to glocal memory in one thread
        while(atomicCAS(mutex, 0u, 1u)!=0u);
        data[0] += share_data_0[0];
        data[2] += share_data_1[0];
        // __threadfence();  // if data is not volatile, need threadfence to make sure write are visible to following read
        atomicExch(mutex, 0u);
    }
}

kernel launched with:

const int thread_num = 512;
method<thread_num><<<int(len/thread_num)+1, thread_num>>>(dev_a, dev_lock, len);