Emulation is for debugging only. It is very, very slow.

The 1st order approximation is that everything scales linearly with the number of multiprocessors. This results from the simple nature of the device, running N independant blocks. If you have M processors to run the blocks, the running time is simply N/(M*concurrent blocks) * running time of one block. There is nothing else to it. Of course… that approximation assumes large N. Things are different at small N, but no less understandable.

The solution to your problem is to turn it upside down. On one card, study how well the timing scales as a function of PROBLEM size. I’m my results, I see beautiful stair-step patterns in the running time as the problem size increases, activating more multiprocessors. That is, from 1-32 blocks running I see exactly the same running time. Then it steps up at 33 but from 33-64 it is constant again. After that, the stair step pattern degrades into a pretty smooth line. This stair-step pattern is consistent with each multiprocessor running 2 blocks concurrently (which matches information from the occupancy analyzer). The problem size gets bigger, but the overall time is limited by the time it takes a multiproc to run 1 (or 2) blocks.

Once you get large enough to be in the linear region, everything should scale perfectly linearly (as long as each of your blocks is doing a similar amount of work). So, given the information of number of number of blocks/multiproc and where the turnover to linear performance is on one device, you could make an estimate of the problem size needed to make full use of a device, with say, 32 multiprocs. Of course, this analysis would also assume that memory bandwidth is increased in proportion…

So that’s the theory approach, and I have yet to run a benchmark that contradicts it. If you are the experimental type, then just get an 8600, 8800 GTS, and an 8800 GTX (which all have different numbers of multiprocs) and extrapolate.