I have questions about compute capability 2.0 as dont have chance to experiment cases.
I think most important bottleneck is register usage in Cuda. In what conditions, local memory is used instead of registers, can you give examples maybe like “-maxregcount”? What is the performance effect of register spilling for compute capability 2.0?
What is the performance loss for strided access to global memory and noncoalesced accesses on compute capability 2.0 considering global cache? For example on 1.x, textured fetch can be used to avoid noncoalesced accesses to global memory as texture memory have cache.
What is the size per multiprocessor for global and local cache onchip for compute capability 2.0? What is total size of onchip memory? Can you give me size distributions? For example, register file size is 48 Kb.
Branch prediction avoids divergence and serialization of instructions. Can you give examples and cases about branch predication? In documents #pragma unroll is not explained in detail.
My other questions are about hardware:
What is the difference between series like GeForce, Tesla and Quadro? Although GeForce series are for game playing, GeForce GTX 480 looks higher performance than Tesla C2050.
There are graphic cards that has two GPUs like Geforce GTX 295. What is the performance gain (2x ?) of using these cards over that has one GPU like Geforce GTX 280? Are these cards programmed using multi GPU principles? Is it possible for GPU - GPU communications without host node as GPUs have same off chip memory?
This is mentioned in the Fermi whitepaper, I believe. Each multiprocessor has 64 kB of memory that can be split 16 kB shared memory / 48 kB L1 cache, or vice versa. The L2 cache is for the entire GPU and has a size of 768 kB. (The exception is the GTX 460, where the L2 cache is cut down in proportion to the reduced number of memory channels. Presumably, this will be documented in the next CUDA release.)
Short version: GeForce has no ECC, less device memory, and the double precision is 1/8 the single precision rate, whereas for Tesla the factor is only 1/2.
The performance gain can be linear, but depends on your problem. The two GPUs appear as two distinct CUDA devices each with separate device memory. NVIDIA implements these dual GPU devices with a PCI-Express switch on the board, so transfers between system memory and device memory have to share the same channel, which can slow things down if you need to perform large simultaneous transfers to both GPUs. There is currently no GPU-GPU direct communication mechanism through the PCI-Express bus, and since the device memories are distinct, there can be no communication through that route either.