Using __restricct__ in CUDA is not giving any significant performance benifit

I am reading about __restrict__ keyword for compiler optimization and I am trying the code in

However the time difference between using __restrict__ and not using it is negligible (in nanoseconds). Does this keyword benefit in anyway? Is there any code example available where I can see a massive benefit by using __restrict__?

Use of __restrict__ isn’t a magic acceleration switch for the compiler. It merely provides an assertion to the compiler that may allow a compiler to generate faster code.

(1) The compiler may be able to figure out on its own that no aliasing exists, or may be unaffected by it in a particular context, possibly after applying other optimizations such as function inlining. As compiler technology improves, assisting the compiler by using __restrict__ may lose importance.

(2) The kind of optimizations enabled or enhanced by the use of __restrict__ may be irrelevant to the performance of the code (see roofline model). Use of __restrict__ most frequently allows the compiler to schedule load instructions more freely, which in general helps improve latency tolerance. The primary latency tolerance mechanism of GPUs is the massive parallelism combined zero-overhead thread switching, so any software contribution could be negligible, and this may depend on the specific hardware used.

In my testing __restrict can help in kernel (and also host) code but ONLY when applied to bare pointers in fuction declarations. If you use any kind of fancy C++ wrapper on pointers restrict is ignored. Thus only use somthing like:

global funct(float * __restict a, float * __restict b)

Results may also depend on the compiler version. The classic matix multiply kernel is is faster with restict. There is more information my book “Programming in parallel with CUDA - a practial guide” (CUP) chapter 2.