What is the speed at which the processors inside the Multiprocessor execute?

do you mean the clock of the stream processors?

it depends on your graphics card.

The GeForce 8800 Ultra eg has a 1500 MHz clock.

Just take a look at the specs.

The deviceQuery sample will show the clock.

Thanks for that info. So, this means that each WARP is executed at 1500/4 = 375Mhz speed. Am I right?

So, If there are 10 warps to be executed, the multi-processor would execute the warps in a RR fashion and this number again drops to 375/10 = 37.5 MHz. Am I right?

So, this seems to be a very very puny number.

CAn this really speeed up applications?

I don’t see the point you are making. You are just changing units from instructions per second to warps per second to blocks per second. For comparison, a general purpose CPU using SSE can operate on 4 floats at the same time, with a clock rate that is 2-3x larger than the 8800 GTX (i’m ignoring how many clock cycles the float takes, how many can be in flight at once and all the other stuff). Even with a quad-core CPU, that’s something on the order of 50 GFLOPS if I’m unrealistically generous. The 8800 GTX performs nominally around 300 GFLOPS because it operates on 128 floats at a time at 1/2 or 1/3 the clock rate. In reality, the performance of both is lower due to memory issues, instruction scheduling, branching etc…

So yes, CUDA does speed up some applications. We’re not just making this up. :)

I think I missed the multi-processing capability. Let me try once more.

Each CPU in a Multi-Processor executes at 1500Mhz and there are 8 CPUs.

Each of this CPU lends 1500/4 lends 375 Mhz for a warp. So, you can imagine 32 processors running at 375Mhz simultaneously.

Now again, there are many warps to execute. Let us assume a full quota of 16 warps for the block. Each of these 32 CPUs running at 375Mhz will now have to time-share between 16 WARPS. i.e. each of these 32 CPUs will be emulating 16 CPUs each running at 23.4375Mhz (375/16).

So, when a block is entirely running on a GTX 1.5Ghz MultiProcessor board, one can imagine 512 CPUs running at 23.4375 running CONCURRENTLY. Note this concurrency is NOT a fake one. I mean – realllllly concurrent.

And, we have 16 such MultiProcessor boards. Assuming each MultiProcessor is executing 1 block at a time – We can imagine 8192 CPUs, each running at 23.4375 Mhz REALLLLY-SIMULATENOUSLY

Is my view right?

I did not see a mistake in your calculations, but then again, I can also not see the use of it. Theoretical performance is just that: theoretical. Your problem dictates your maximum performance, so trying is knowing.

I was just trying to visualize a palpable model which is free of multi-tasking etc… Just the real-parallel computational equivalent.

Also, I wanted to get an endorsement of my understanding of WARPs is. Actually, I had misconception earlier that there were 768 cores inside one Multi-Processor – which I got cleared later. So, I just wanted to verify my understanding with other members of the group.

Glad that people helped with this. THanks.