So from a few tests with nvcc (also clang), I discovered a (at least for me) strange behavior (discussion can be found here: __restrict__ seems to be ignored for base pointers in structs. having base pointers with restrict as kernel arguments directly works as expected).
I simplified the example as such that the same code (use locally restrict annotated pointers inside the function or inside a local scope) results in totally different code. Having no scope, nvcc’s optmizer is able to spot the same loads (and stores), resulting in 1 load and 1 store, which is expected (at least from my side). Still, having the exact same code inside a local scope breaks the optimization, resulting in 2 loads in 2 stores. clang always fails to optimize this in both cases.
So here is the cooked down example:
#define RESTRICT __restrict__
//#define RESTRICT
#define SCOPE
// Type your code here, or load an example.
__global__ void square(const float * g_input, float * g_output, int n) {
int tid = blockIdx.x;
//if (tid < n)
#ifdef SCOPE
{
const float * RESTRICT a_in = g_input;
float * RESTRICT a_out = g_output;
#else
const float * RESTRICT a_in = g_input;
float * RESTRICT a_out = g_output;
#endif
float la = a_in[tid];
a_out[tid] = la;
la = a_in[tid];
a_out[tid] = la;
#ifdef SCOPE
}
#endif
}
If SCOPE is defined, it results in 2 loads in 2 stores. Otherwise, it’s optimized with 1 load and 1 store only.