Blocks are scheduled at the SM level, not core level. Cores are not scheduled independently at all - they run a single warp in SIMD fashion at a time. The number of blocks per MP is limited by (a) the block size and warps per MP limit, and (b) the kernel register and shared memory resource requirements. So if you had 32 threads per block (ie 1 warp per block), and each thread didn’t use a lot of registers or threads, you could theoretically have 48 blocks active per MP (ie. the 1536 threads per MP limit). This is also discussed in the programming guide (Chapter 5).
No. Nothing is that sentence is correct. The GTX470 doesn’t have four as many cores as a GTX285 - it has 1.85 times (448 versus 240). And the clock speed isn’t the same, it is actually a little slower. But then there are considerable architectural differences that make it more complex - 14 versus 30 MP, an instruction retire rate of 2 warps per 4 clock cycles per MP on Fermi versus 1 on the GTX285, memory bandwidth, cache structure differences, instruction pipeline size and latency differences… The list is long.
There are memory bandwidth differences as well. It depends on what you define “faster” as. If you mean peak flops (ie unobtainable maximum theoretical throughput), then you are close. If you mean real world performance, you are not that close.
Well the number of cores = number SM * 32, so indirectly they do. But not for scheduling, which is what you were originally asking about.
That’s because pre-fermi architectures have 8 cores per SM and a warp is quad-pumped to execute all 32 threads of the warp. So there are 30 active warps (3032=960 threads), but only 1/4th (308) of them are executed simultaneously.
Max # of active threads (more active threads means more opportunity to hide global memory latency by switching between them. this is what occupancy measures)
Throughput: rate of instruction completion
A GTX 285 has 30 SMs which can finish 1 instruction (basic arithmetic, logical operations, etc) for 1 warp in 4 clock cycles. A GTX 470 has 14 SMs which can finish an instruction for two warps (potentially different instruction for each warp) in 2 clock cycles.