A quick postscript to this topic about accessing global memory. I was finding severe slowdowns for particular matrix sizes (all still multiples of 16 to ensure coalesced global memory reads at all times) of a Cholesky factorization routine, and suspected it was to do with how global memory was partitioned between memory channels on the card. The problem was particularly nasty since it affected different cards in different ways.
My original code accessed “block-columns” (i.e. strips 16 wide) and so on reflection should have been particularly susceptible to problems, since then for certain widths all the data in a block-column could map to a single channel. I’ve now rewritten the code to access “block rows” (i.e. strips 16 high) and now the data in a block-row must be shared between channels. (New code here.)
It turns out that the severe (i.e. up a factor of 2 or 3) slowdowns are gone, and an 8800 GTX now almost always beats an 8800 GTS 640MB as one would hope. The runtime does still not quite uniformly increase with matrix size (e.g. a 12304x12304 matrix takes 7.8s whereas a 12288x12288 matrix takes 8.9s on a GTX, and a 12480x12480 matrix takes a particularly long 10.9s), but other effects like load balancing between blocks might also be having some affect at this level.
So it does seem that if one sees substantial slowdowns for certain array sizes in a program that does a lot of memory accessing it might be worth trying to either access the data more “rowwise” than “columnwise” or else pad the array somewhat (this latter is tricky because of the card-dependence though).