Will there ever be a time when memory reads are NOT done in a contiguous 128 byte chunk?

Like, how long until a warp full of load instructions can just gather the data from arbitrary locations in memory?

I feel like this is one thing that would make CUDA performance and ease-of-use skyrocket (especially ease-of-use).