Note that the compiler automatically reorders load instructions as far ahead of the computation as possible anyway. So there will be no difference unless there is a dependency that the compiler cannot resolve automatically.
The [font=“Courier New”]__synctreads()[/font] instruction in their example acts as a memory barrier, thus qualifying as a dependency the compiler may not resolve automatically.
While the compiler will indeed reorder things as best it can, you can still change your code to give the compiler more opportunities.
This usually means some manual unrolling of loops. “Prefetching” is main idea of the general theme of increasing ILP (Instruction Level Parallelism).
The results can be quite measurable for some tight compute patterns.
Indeed. One shouldn’t take quotes out of context. In the original quote, I must have been referring to some specific code or situation that was being described. Prefetching by hand can, but not necessarily always will, improve performance. I use it in a few select kernels in my code - most kernels it doesn’t make a difference at all.