I agree, especially because the new cards have much better memory coalescing capability. For example, I used to access a 3D array (of float4) the usual way on a 8800GT:
variable accessed by thread (i,j,k) = array[Nx*(Ny*(k-1) + (j-1)) + (i-1)],
so that everything was coalesced according to the old rules. Now, I can access the array on a GTX280 using indirect addressing:
variable accessed by thread (i,j,k) = new_array[ pointer_array[Nx*(Ny*(k-1) + (j-1)) + (i-1)] - 1],
which breaks the old coalescing rule. Surprisingly, not only do I not get any speed penalty for doing so, but I actually get a small (0.5%) speedup! At the same time, my memory usuage went down by 45% because I now only have to allocate the much smaller, condensed array ‘new_array’.