Clarification of __restrict__ in cuda

So from a few tests with nvcc (also clang), I discovered a (at least for me) strange behavior (discussion can be found here: __restrict__ seems to be ignored for base pointers in structs. having base pointers with restrict as kernel arguments directly works as expected).
I simplified the example as such that the same code (use locally restrict annotated pointers inside the function or inside a local scope) results in totally different code. Having no scope, nvcc’s optmizer is able to spot the same loads (and stores), resulting in 1 load and 1 store, which is expected (at least from my side). Still, having the exact same code inside a local scope breaks the optimization, resulting in 2 loads in 2 stores. clang always fails to optimize this in both cases.

So here is the cooked down example:

#define RESTRICT __restrict__
//#define RESTRICT

#define SCOPE

// Type your code here, or load an example.
__global__ void square(const float * g_input, float * g_output, int n) {
    int tid = blockIdx.x;

    //if (tid < n)
#ifdef SCOPE
    {
        const float * RESTRICT a_in = g_input;
        float * RESTRICT a_out = g_output;
#else
    const float * RESTRICT a_in = g_input;
    float * RESTRICT a_out = g_output;
#endif
        

        float la = a_in[tid];
        a_out[tid] = la;
        la = a_in[tid];
        a_out[tid] = la;

#ifdef SCOPE
    }
#endif
        
}

If SCOPE is defined, it results in 2 loads in 2 stores. Otherwise, it’s optimized with 1 load and 1 store only.