As i know, there is a performance tuning techinique, which called prefetching. Professor Hwu, who come from UIUC, has mentioned it and provided pseudo code for matrix multiplation in his lecture, but i have no idea about how to implement it in my program. Does anybody can tell something more about it, code example is better as i want to implement it. :)
Thanks in advance
The slice and pseudo code is as following:
[codebox]prefetching: one could double buffer the computation, getting better instruction mix within eachd thread
This is classic software pipelining in ILP compilers
For now, it seems that this is the only way to enhance my program, so it is important for me, any idea about issuing Loading next data after loading current data is welcome.
What more do you need than the example you already gave?
If you want a real code example where prefetching is used, HOOMD’s LJ force sum kernel prefetches indicies out of an array (cur_neigh / next_neigh are used for the prefetching). See the code here: http://trac2.assembla.com/hoomd/browser/tr…a/LJForceGPU.cu . It is real application code and is probably a little dense to understand, though. The example you posted is a nice simple pseudocode explaining the concept.