Been playing with cuda for a bit and it is impressive how easy it is to use, hats off to the nv folk. However… am quite disappointed with the rather extreme coalescing rules, as it has such a large perf penalty. It seems the most efficient use is akin to SoA vs AoS style programming e.g. passing alot of arrays into a kernel vs an array of a structure gives significant perf advantage.
Gather functionality is no problem using the tex unit, so randomish patterns still run at reasonable speed, but theres no nice solution for scatter functionality at speed… best I can find is resolve to a 2 pass kind of like solution so, pass 1 writes the index the 2nd pass should read randomly(via tex) and write in a nice orderly fashion. e.g. mapping from a scatter to a gather pattern at a x2 cost in bandwidth… not nice… but beats using uncoalesced writes…
So guess my question is, in the future will uncoalesed rw always just suck so badly? or will the rules get relaxed more? Currently, I think this is the biggest flaw in cuda.