To my knowledge, the first Fermi cards will have a memory interface width of 384 bits. This means that only 12 floats can be read at the same time. At the same time, it is a good idea to use for example 16 x 16 threads in each block for optimal performance. Will not this be very inconvenient for memory reads and writes? Will a 16 floats read be divided into one with 12 floats and one with 4 floats? Am I missing something here or is there a quick solution to the problem?
Regarding the shared memory in Fermi it is said to be 48 KB shared memory and 16 KB L1-cache, or vice versa. If I only want to use 64 KB shared memory, is that possible, i.e. can the L1-cache be used as shared memory as well?
It’s all abstracted from you. I’m sure the architecture guys could explain it in detail, but the bus width doesn’t affect performance noticeably beyond the expected bandwidth of bus width * frequency. The exact packing of data and address per line per clock are probably interesting for the hardware guys to dig into, but for CUDA programming it’s ignorable.
For example, the GTX260 and GTX275 has a bus width of 448 and the GTX285 has a bus width of 512, yet it’s not like you’re hurt by 16-float memory accesses on the GTX275. The GT 130 uses a 192 bit bus width, as does the 9600GSO. The 8800 GTX and 8800 Ultra also used a 384 bit bus width.
The full Fermi docs are not out, but the initial presentations and white paper listed 16/48 or 48/16 as the two L1/Shared options. 64/0 is really unlikely since some shared memory is used by all kernels for housekeeping.