Performance of NVS 160M vs Core 2 Duo

I’m developing a CUDA-based application on a laptop with a 3.06GHz Intel T9900 Core2 Duo CPU and an NVIDIA NVS 160M GPU. Using a simple test kernel (a loop that terminates identically in all threads and performs simple arithmetic on local variables) with the maximum number of threads, I find that the GPU is only about twice as fast overall as a single CPU core working sequentially (for equivalent throughput). If I make use of both CPU cores there’s effectively no speedup using the GPU.

Are these results expected for this CPU and GPU?

NVS 160M, about 6 years old, is one of the lowest compute-power CUDA-capable discrete GPUs ever produced:

Having said that, questions like yours are extremely difficult to answer. Performance depends very much on the specifics of your code, which you haven’t told us much about. Even a statement like “maximum number of threads” isn’t very clear. The maximum number of threads in a grid that GPU can launch is quite large. OTOH if you’ve limited yourself to the maximum number of threads that can be simultaneously executing at any given instant, or the maximum number of threads in a single block, or any of a number of other possible coding errors, you may be artificially limiting the performance of that GPU.

Having said all that, I wouldn’t expect great things out of that GPU. One of the big benefits of having a GPU enabled laptop is to be able to have a complete CUDA development environment (compile, run, debug, profile). If you’re really looking for high CUDA compute performance in a laptop, you should look for one that doesn’t have the lowest end discrete CUDA GPU ever produced.

And if you want to judge the performance of your CUDA code (probably what you had in mind with this question), I would suggest trying one of the various profilers:

and learning how to do analysis-driven optimization:

Thanks for your reply. I will investigate the links you gave.

By maximum I meant the maximum number executable in parallel at one time - mentioned to avoid suggestions to increase performance by unrolling the loop. My test kernel is just that, with only a simple loop to simulate the final app. The object of my test is to find an upper bound on the performance increase I can expect. To the best of my knowledge I cannot increase performance by increasing parallelism, reducing branching or reducing memory access, since these are already maximised or minimised, as appropriate, in the test kernel. I will see if the profiler gives a different view.

To get the most performance out of the GPU, you generally want “lots” of threads. The GPU behavior here is not identical to CPU threads. If you are limiting the number of threads you launch on the GPU to some small number (say, less than a few thousand) you are probably making a GPU programming error. Two basic keys to GPU performance are to have enough threads to “saturate the machine” (and there is generally little penalty for exceeding this), and to organize memory accesses for coalescing.