Help will be greatly apprciated.
It’s a little hard to tell what you are trying to do from just a straight up code fragment. But here are two suggestions.
- Process many rows in parallel.
- Coalesce all reads (or use textures). The reads in this line are not coalesced “test = current[index] - (choices[2*i]/(float)period);”
most simple is copy all the data in one big copy and not allot of small ones, and malloc and free only once.
further, u can read the choices into shared mem before u start working on it (u use exactly the same one for each thread)
pass period as a float to begin with …
thats just of the top of my head …
have fun ! :)
Even better, pass it as 1/float to begin with (frequency instead of period)… division is costlier than multiply
Unfortunately, your algorithm is not ‘algorithmic intense’, meaning that you have very little
calculations, but high memory access. For CUDA this implies that you suffer from those large
memory read latencies, because they cannot be hidden by the thread scheduler.
The only way of optimizing such an algorithm is by reducing the memory access. As Eri said,
try to share some data between threads. One possibility is using textures for your arrays
‘current’ or ‘choices’. An other way is to first load the required data for this block into shared
memory, synchronize the threads to ensure loading has finished, and than work on the shared
memory instead of directly accessing global memory.
The simpler solution is probably to use textures, but if your data can be stored in shared
memory, I think that is the even better solution.