This is a reply to this topic and should probably go into a new one. Here is the detailed data from my measurements of the performance of the device memory subsystem on G80. Take a look to see where your proposed thread, block and occupancy will land in terms of device memory throughput. These measurements are for the trivial program with 1 warp wide device memory access of each of different flavors looping through or scattering around a reasonable sized buffer. It does seem to help predict performance for more complex memory bound apps.
All the description and my interpretation is in the attached text file.
Simon: I am pretty sure that warp wide coalescing is required for 32 bit reads and nothing else - stands up to reason given the measured throughput is more than one can read in a 1/2 warp ie one needs to fill a warp in 3 memory cycles to have any hope of meeting the measured throughput.
ed: another busted link