emulation for measuring performance scaling


For a grad level arch class, my prof would allow us to investigate GPGPU using CUDA, but under the condition that we test how well algos scale with increasing numbers of stream processors.

I am correct in understanding the current emulation does not provide this feature? If not, what other recourse do I have if I can’t get my hands on more advanced cards? Would it be difficult to write a simulation for this?

I appreciate feedback since it would be very interesting to me to work with CUDA as opposed to other possible projects.


Emulation is for debugging only. It is very, very slow.
The 1st order approximation is that everything scales linearly with the number of multiprocessors. This results from the simple nature of the device, running N independant blocks. If you have M processors to run the blocks, the running time is simply N/(M*concurrent blocks) * running time of one block. There is nothing else to it. Of course… that approximation assumes large N. Things are different at small N, but no less understandable.

The solution to your problem is to turn it upside down. On one card, study how well the timing scales as a function of PROBLEM size. I’m my results, I see beautiful stair-step patterns in the running time as the problem size increases, activating more multiprocessors. That is, from 1-32 blocks running I see exactly the same running time. Then it steps up at 33 but from 33-64 it is constant again. After that, the stair step pattern degrades into a pretty smooth line. This stair-step pattern is consistent with each multiprocessor running 2 blocks concurrently (which matches information from the occupancy analyzer). The problem size gets bigger, but the overall time is limited by the time it takes a multiproc to run 1 (or 2) blocks.

Once you get large enough to be in the linear region, everything should scale perfectly linearly (as long as each of your blocks is doing a similar amount of work). So, given the information of number of number of blocks/multiproc and where the turnover to linear performance is on one device, you could make an estimate of the problem size needed to make full use of a device, with say, 32 multiprocs. Of course, this analysis would also assume that memory bandwidth is increased in proportion…

So that’s the theory approach, and I have yet to run a benchmark that contradicts it. If you are the experimental type, then just get an 8600, 8800 GTS, and an 8800 GTX (which all have different numbers of multiprocs) and extrapolate.

It also depends on whether or not the problem is memory-bound or not, if it is, activating more multiprocessors does not neccesarily give an equal speed increase. If you’re compute-bound then you’d expect it to be like MrAnderson42 says.