Can volatile just bypass L1 but still use L2?

Can volatile just bypass L1 but still use L2?

__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;
__global__ void sum(const float* array, unsigned int N,
                    volatile float* result)
{
    // Each block sums a subset of the input array.
    float partialSum = calculatePartialSum(array, N);

    if (threadIdx.x == 0) {

        // Thread 0 of each block stores the partial sum
        // to global memory. The compiler will use
        // a store operation that bypasses the L1 cache
        // since the "result" variable is declared as
        // volatile. This ensures that the threads of
        // the last block will read the correct partial
        // sums computed by all other blocks.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions

Yes, that is what volatile indicates: bypass L1. Specfically:

The compiler is free to optimize reads and writes to global or shared memory (for example, by caching global reads into registers or L1 cache) … These optimizations can be disabled using the volatile keyword…

Nothing on the GPU bypasses L2, if you are requesting something backed in device memory. Therefore volatile does not change any behavior with respect to the L2. (Yes, I am ignoring L2 persistence mechanism for this statement, although I claim that mechanism is not “bypassing L2”, but changing its caching characteristics, but don’t wish to argue that point.)

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.