ask for code example of prefetching

Hi, everyone

As i know, there is a performance tuning techinique, which called prefetching. Professor Hwu, who come from UIUC, has mentioned it and provided pseudo code for matrix multiplation in his lecture, but i have no idea about how to implement it in my program. Does anybody can tell something more about it, code example is better as i want to implement it. :)

Thanks in advance

The slice and pseudo code is as following:

[codebox]prefetching: one could double buffer the computation, getting better instruction mix within eachd thread

This is classic software pipelining in ILP compilers

Load next tile from global memory

Loop {

Deposit current tile to shared memory

syncthreads()

Load next tile from global memory

Compute current subblock

syncthreads()

}[/codebox]

Didn’t nobody use it?

For now, it seems that this is the only way to enhance my program, so it is important for me, any idea about issuing Loading next data after loading current data is welcome.

Thanks.

What more do you need than the example you already gave?

If you want a real code example where prefetching is used, HOOMD’s LJ force sum kernel prefetches indicies out of an array (cur_neigh / next_neigh are used for the prefetching). See the code here: http://trac2.assembla.com/hoomd/browser/tr…a/LJForceGPU.cu . It is real application code and is probably a little dense to understand, though. The example you posted is a nice simple pseudocode explaining the concept.

See the matrix-matrix multiply code at http://forums.nvidia.com/index.php?showtop…mp;#entry314014 for an example of using prefetching (sgemmNT kernel)

A real code, i mean. pseudo code is not enoughf for me as i have never used it before, maybe i didn’t express what i want very well…

Anyway, thanks to Mr.Anderson and vvolkov, you are very kind. :thumbup: