Hi. I have an algorithm, coded in PGI CUDA Fortran that is delivering a nice speeup of order 50. However, I can clearly identify the rate determining step as being due to the non contiguous reading of a field map by my millions of particles. Essentially, the particles rapidly get scrambled up in space due to motion in a magnetic field, scattering etc.
I am faced with the following question - do I attempt to periodically re-order my particles using a GPU sort kernel in order to try and maintain coallesced access of my field map (which is static, i.e. does not evolve with time)? I have read various papers where developers have done this and achieved a suitable speed-up.
More promising however would be if I could set my field map to reside in texture memory (as read-only is OK). Many people with similar problems have seen a big speed up by doing this.
My question then is this - when can we expect texture memory to be made available in PGI cuda-fortran? An earlier post suggests this year - are your engineers on track with this or have other developments bumped this task down their list? Would I be better recoding my kernel in c ?
Any info greatly appreciated as always. Also, any advice on GPU sorting algorithms for doing what I suggest above would be very useful - I recently came across the Thrust library for example which looks like it contains some useful operators (including some routines which might be useful for sum reductions).
Rob.