Thanks for your reply. I checked this programming guide. It seems to me only max threads per block and max number of blocks per grid are wrong. What else is wrong???
Blocks are scheduled at the SM level, not core level. Cores are not scheduled independently at all - they run a single warp in SIMD fashion at a time. The number of blocks per MP is limited by (a) the block size and warps per MP limit, and (b) the kernel register and shared memory resource requirements. So if you had 32 threads per block (ie 1 warp per block), and each thread didn’t use a lot of registers or threads, you could theoretically have 48 blocks active per MP (ie. the 1536 threads per MP limit). This is also discussed in the programming guide (Chapter 5).
Yes, but the 470 has 14 SMs and the 285 has 30, so the net gain in total instruction throughput is less than 4x. (Additionally, the clock on the GTX 470 is lower than the GTX 285.)
No. Nothing is that sentence is correct. The GTX470 doesn’t have four as many cores as a GTX285 - it has 1.85 times (448 versus 240). And the clock speed isn’t the same, it is actually a little slower. But then there are considerable architectural differences that make it more complex - 14 versus 30 MP, an instruction retire rate of 2 warps per 4 clock cycles per MP on Fermi versus 1 on the GTX285, memory bandwidth, cache structure differences, instruction pipeline size and latency differences… The list is long.
There are memory bandwidth differences as well. It depends on what you define “faster” as. If you mean peak flops (ie unobtainable maximum theoretical throughput), then you are close. If you mean real world performance, you are not that close.
Well the number of cores = number SM * 32, so indirectly they do. But not for scheduling, which is what you were originally asking about.
It depends on your interpretation of concurrent thread execution. Each SM also has 16 load/store units and 4 SFUs which can be occupied by threads other than the ones running on the cores.
But they have 30 SMs. Each can execute one warp of 32 threads concurrently. Does that mean they can execute 960 threads concurrently even though they have only 240 cores?
That’s because pre-fermi architectures have 8 cores per SM and a warp is quad-pumped to execute all 32 threads of the warp. So there are 30 active warps (3032=960 threads), but only 1/4th (308) of them are executed simultaneously.
Max # of active threads (more active threads means more opportunity to hide global memory latency by switching between them. this is what occupancy measures)
Throughput: rate of instruction completion
A GTX 285 has 30 SMs which can finish 1 instruction (basic arithmetic, logical operations, etc) for 1 warp in 4 clock cycles. A GTX 470 has 14 SMs which can finish an instruction for two warps (potentially different instruction for each warp) in 2 clock cycles.
No, that’s right. The 470 clock rate is lower, so the difference in floating point performance is a factor of 1.5. It’s the same instructions as mentioned in the programming guide appendix.
So now we’ve come full circle to how the peak GFLOPS is compute:
2 (for multiply-add instruction) * number of CUDA cores * clock rate
In both the GTX 285 and the GTX 470, each “CUDA core” (or SPs, as they used to call them) can finish 1 MAD per clock. They are just grouped differently in the two architectures.