Tesla C2050/GTX 470 limits?

Hi I am new to GPGPU but I am about to buy a GTX 470 to try out CUDA in Linux.

I want to know whether I have any misunderstanding about the limits of GTX 470:

14 SM
max 32 threads per warp

each SM can execute two warps concurrently (dual issue)
max threads executed concurrently is 896

each SM can schedule 48 warps max
max threads scheduled concurrently is 21,504

each SM can schedule 8 blocks max
theoretically can have at most 114,688 threads but is limited by the max warp an SM can schedule

max 1024 threads per block
max 65,535x65,535=4,294,836,225 blocks per grid
no limit to how many grids
one grid per kernel

Anything wrong???

A lot of that isn’t correct. There is a nice summary on p140 of the CUDA 3.0 programming guide.

Thanks for your reply. I checked this programming guide. It seems to me only max threads per block and max number of blocks per grid are wrong. What else is wrong???

This bit is very wrong:

Blocks are scheduled at the SM level, not core level. Cores are not scheduled independently at all - they run a single warp in SIMD fashion at a time. The number of blocks per MP is limited by (a) the block size and warps per MP limit, and (b) the kernel register and shared memory resource requirements. So if you had 32 threads per block (ie 1 warp per block), and each thread didn’t use a lot of registers or threads, you could theoretically have 48 blocks active per MP (ie. the 1536 threads per MP limit). This is also discussed in the programming guide (Chapter 5).

Oh I see. So cores has nothing to do with how many threads u can schedule, right??? What role does core play in all this??

Execution speed only

So if clock speed is the same, 470 has 4x cores per SM than 285, then 470 should execute 4x faster per SM???

Yes, but the 470 has 14 SMs and the 285 has 30, so the net gain in total instruction throughput is less than 4x. (Additionally, the clock on the GTX 470 is lower than the GTX 285.)

No. Nothing is that sentence is correct. The GTX470 doesn’t have four as many cores as a GTX285 - it has 1.85 times (448 versus 240). And the clock speed isn’t the same, it is actually a little slower. But then there are considerable architectural differences that make it more complex - 14 versus 30 MP, an instruction retire rate of 2 warps per 4 clock cycles per MP on Fermi versus 1 on the GTX285, memory bandwidth, cache structure differences, instruction pipeline size and latency differences… The list is long.

Thanks for your reply.

What about 470 vs 480 then? Should 480 only be faster than 470 for about 15/14 x then if both are running at the same clock speed???

If so, number of cores still don’t play anything in this picture. All you need to know is how many SMs only, right???

There are memory bandwidth differences as well. It depends on what you define “faster” as. If you mean peak flops (ie unobtainable maximum theoretical throughput), then you are close. If you mean real world performance, you are not that close.

Well the number of cores = number SM * 32, so indirectly they do. But not for scheduling, which is what you were originally asking about.

There are only 448 cores , so at max only 448 threads can execute concurrently. (Maximum no of resident threads per SM is 1536)

Even though its a dual issue, two threads can not run concurrently on one core.

But that still does not mean that two threads are running concurrently on a single core.

Please correct me if I am wrong !

It depends on your interpretation of concurrent thread execution. Each SM also has 16 load/store units and 4 SFUs which can be occupied by threads other than the ones running on the cores.

N.

What about 275/280/285???

They have 240 cores

But they have 30 SMs. Each can execute one warp of 32 threads concurrently. Does that mean they can execute 960 threads concurrently even though they have only 240 cores?

edit: double post

That’s because pre-fermi architectures have 8 cores per SM and a warp is quad-pumped to execute all 32 threads of the warp. So there are 30 active warps (3032=960 threads), but only 1/4th (308) of them are executed simultaneously.

N.

32 threads execute on 8 CUDA cores in a pipelined manner, taking a minimum of 4 clock cycles per instruction.

There are only two things that matter here:

  • Max # of active threads (more active threads means more opportunity to hide global memory latency by switching between them. this is what occupancy measures)

  • Throughput: rate of instruction completion

A GTX 285 has 30 SMs which can finish 1 instruction (basic arithmetic, logical operations, etc) for 1 warp in 4 clock cycles. A GTX 470 has 14 SMs which can finish an instruction for two warps (potentially different instruction for each warp) in 2 clock cycles.

If I understand you correctly, after four clock cycles, GTX 285 can finish 3032=960 instructions and GTX 470 can finish 142322=1,792 instructions???

In reality, they are probably not comparable because the instructions they can execute are a bit different.

No, that’s right. The 470 clock rate is lower, so the difference in floating point performance is a factor of 1.5. It’s the same instructions as mentioned in the programming guide appendix.

So now we’ve come full circle to how the peak GFLOPS is compute:

2 (for multiply-add instruction) * number of CUDA cores * clock rate

In both the GTX 285 and the GTX 470, each “CUDA core” (or SPs, as they used to call them) can finish 1 MAD per clock. They are just grouped differently in the two architectures.