I was wondering, is there any way to control how many multiprocessors are being used on a given data set.
I looking for confirmation or rejection on this theory:
Background (from CUDA occupancy calculator):
I own GTX 280 which equates to 30 multiprocessors
Kernel shared usage limits me to 128 threads per block
Register usage limits me to 8 blocks per multiprocessor
If I run a kernel with just 8 blocks is it guaranteed to run on a single multiprocessor or will it be balanced across 8 multiprocessors?
If I run a kernel with 9 blocks is it going to fully occupy one multiprocessor plus one for the remainder?
If I run a kernel with 16 blocks it it going to fully occupy two multiprocessors
If I run a kernel with 241 (30 * 8 + 1) blocks is it going to fully occupy all multiprocessors plus one on its own?
Basically what I’m trying to achieve here is to see speedups of throwing an increasing amount of processors at a constantly sized data-set. Short of purchasing several different graphics card I’m looking for a way to emulate the performance of other cards.
Any fewer than 30 blocks should assign one block to each multiprocessor. More than 30, up to 240, all blocks will be dispacthed at kernel launch, and all multiprocesssors will get at least one block, and time-slice between them. More than 240, some blocks will have to wait for other blocks to finish before they can be assigned to a multiprocessor.
Simulating the performance of other cards is going to be a challenging task. Memory bandwidth is substantially different, and 1.3 cards will coalesce in some cases when 1.1 cards do not. Good 1.1 cards only cost about $150 so I’d say get at least one, and extrapolate to other 1.1 cards from it.