Is manual prefetching useless?

In the forum I found a quote from MisterAnderson42 stating that:

http://forums.nvidia.com/index.php?showtopic=92979&pid=522764&start=&st=#entry522764

Does this mean, there is no point in doing prefetching by hand[1]?

[1] Prefetching example from the same post:

__gloabal__ kernel (unsigned *g_mem) {

x = g_mem[thid]; // "prefetch" x

.. do a lot of arithmetics that do not depend on 'x'..

** x is used here **

}

Note that the compiler automatically reorders load instructions as far ahead of the computation as possible anyway. So there will be no difference unless there is a dependency that the compiler cannot resolve automatically.

Hi Tera,

thanks for your answer. One more thing. How does this fit together with what Kirk and Hwu write in their book about prefetching?

See here https://sites.google.com/site/cudaiap2009/materials-1/cuda-textbook chapter 5 page 14f on prefetching

The [font=“Courier New”]__synctreads()[/font] instruction in their example acts as a memory barrier, thus qualifying as a dependency the compiler may not resolve automatically.

While the compiler will indeed reorder things as best it can, you can still change your code to give the compiler more opportunities.
This usually means some manual unrolling of loops. “Prefetching” is main idea of the general theme of increasing ILP (Instruction Level Parallelism).

The results can be quite measurable for some tight compute patterns.

Here’s Vasily’s excellent (really excellent!) analysis and presentation from GTC 2010.

Indeed. One shouldn’t take quotes out of context. In the original quote, I must have been referring to some specific code or situation that was being described. Prefetching by hand can, but not necessarily always will, improve performance. I use it in a few select kernels in my code - most kernels it doesn’t make a difference at all.

Thanks, that was the answer I was looking for. Cheers.

Thanks for the pdf. It looks really quite extensive.