Scalability issue Scaling over the processors available

I would want to analyze the speed ups achieved in CUDA over the number of processors that are available. like 1 , 4, 16, 32 upto 112 ( 8800 GT ) thereby analyzing the scalability. I would like to test how varying the block sizes would impact on the scalability as well, like a block size of 16, on 32 processors gives a speed up of x, and so on.

Is there any way of doing this?

The ugly hack to disable a processor (for testing only!) is really a hack, and it adds overhead, but it does work.

Set up two global variables. Initialize one with the number of work blocks. initialize the other (call it “waiters”) with 0.

To run your kernel with N processors disabled:
At the start of a block, do a global atomic increment on the “waiters” variable. If the block gets a return value of less than N, that processor will switch to “SPIN mode”… it just keeps reading and re-reading the work block counter…when it reaches 0, the wait block exits.

Blocks which didn’t get a “waiter” flag do an atomic decrement on the block count variable. They use THAT number as their effective block index.

It’s a hack but it works.
It’d be cooler if you could just tell the graphics card driver to temporarily disable some SPs, but nobody’s figured that out yet. (Have they?)

How does that disable a multi-processor?? A block is a schedulable entity in an MP. It cannot lock an MP out. The WARP scheduler will schedule other blocks as well… On the whole, some blocks will execute sense and others will just spin to death…

I dont think this will give you an effect of shutting down MPs.


I think the whole purpose of shutting down MPs and analyzing performance might be futile!!

If you have enough blocks to saturate 8 multiprocessors – and you shut down 4 and run the kernel on 4 Multiprocessors – your performance must be down by 50% – I think thats quite a logical conclusion…

Multi-procesors execute in parallel with absolutely no sychrnoization – The only shared resource they contend for is global memory! So, if your kernel is extremely sensitive to global memory latencies-- your performance might come down by, may b, 35% instead of 50% – because of the reduced contention of global memory

However if your kernel is less sensitive to global memory latency, then it is sure to come down by 50% – assuming that your kernel launch completely saturates the MP.

You should go find out how CUDA works – before designing your benchmark exercise.

Not sure if you are aware –

btw, running on 1 CPU n all may not make any sense in CUDA. CUDA is not just powered by CPU power – It is also powered by the effective warp scheduling which hides latencies – Hiding latency is the key to CUDA performance!

If you run in 1 CPU, 2 CPUs etc… no latency will be hidden and you might get abysmally low performance.

You’re right that the hack doesn’t work in the general case… only when your blocks use enough resources that they prevent multiple simultaneous scheduled blocks. That’s common. In most of my apps, I’m stealing every byte of shared memory possible so my one-block occupancy works.

The real answer is a driver option (or even CUDA extension). It may also be card firmware… is a 280GTX turned into a 260GTX by blowing hardware fuses? Board level changes? By firmware?

What I’d especially love (but it’s unknown if the hardware supports it) would be multiple simultaneous kernels (Kernel 0 gets 2 SPs, kernel one gets these 4 SPs) etc. That’d be awesome to give the graphics subsystem absolute possession of some resources to prevent display lag and stutter, freeing your compute threads to be greedy and selfish.

But now I’ve drifted onto a different topic…

Something along those lines was mentioned on a slide at NVISION. At least that is what MrAnderson and I were thinking Adaptive Workload Partitioning meant.

With the stream API in place, this is a logical place to go with the architecture.

BTW, any theories on what “Virtual Pipeline” means?

I’m going to guess that it’s referring to the graphics pipeline. A virtual pipeline would be freeing any assumptions about processing steps and instead define them all by software. This is the plan for Larrabee as well.

Currently, you have things like object decomposition, vertex setup, tesselation, frustrum culling, projection and rasterization, shading, texture mapping, etc.

A virtual pipeline removes all assumptions and restrictions about what that pipeline does, how many stages it has, etc. Much more versatile. The penalty is that software is often slower than fixed hardware, but the versatility gain can offset that, often significantly, giving net speedups because you can optimize (or delete) parts of the pipeline, or use resources more flexibly.

This has already happened once recently, G70 had separate vertex and pixel stages with unique hardware for each. G80 unified the two, making vertex and pixel shading software defined, allowing better load balancing and therefore faster evaluation.

The above is just a guess, BTW, based on two words on a slide.

Sure, sound reasonable, but I’m trying to understand the physical manifestation of this. Perhaps a way to send data directly from one multiprocessor to another, bypassing global memory?

Also sounds like a good idea once you have the ability to divide multiprocessors into groups to run multiple kernels simultaneously.