CUDA - Make a specific memory access skip the cache

I have a Kernel which first read values from certain memory locations then writes to those memory locations. I also have a lock which ensures that at any point in time only one thread is writing to any memory location.

The kernel looks somewhat like this:

__global__ void fun(){    
    if(!lock())return; // if lock fails, return

    if(!checkCondition()){ // memory read
        release_lock();
        return
    }

    update(); // memory write

    release_lock();
}

The kernel runs as expected when I launch on a single block. When I launch on multiple blocks, I get some errors. I was able to find out that this is because the changes made by an update() call are sometimes not reflected in a checkCondition() call which is executed after the update() call.

If I disable the L1 cache using compiler flags -Xptxas -dlcm=cg, this issue disappears, so I inferred that the error is arising because of threads from the second block reading stale values from the L1 cache.

Disabling the L1 cache makes my program run much slower though, so I am looking for other ways to fix this error.

Is there any way to make sure that either the L1 cache is updated immediately after every relevant write, or that every relevant read bypasses the L1 cache, without completely disabling it?

You can apply “cg” cache behaviour on a given read by utilizing the T __ldcg(const T* address); function.