Performance evaluation

Hi everybody,

I am currently running a program using a kernel that does operations on a worst case scenario to evaluate the maximum frame rate per second that we can get with our graphic card, a GeForce 8600.

We will be using, in our final product, a Fx 3600 and I was wondering if there was a way to predict, approximately, what will be the performances using this graphic card.

This specific kernel can do, using the CUDA Profiler, approximately 124 instructions by usecond. Knowing how many instructions I need to run the kernel, I can predict the time that it will take to run the kernel in the GPU in useconds on my GeForce 8600 and it seems quite accurate yet.

Knowing that the Fx 3600 has 12 multiprocessors and that the GeForce 8600 has only 4 of them, can I say that the number of operations by useconds could be around 3 times those of the GeForce, so about 372 instructions/useconds or is it a too simplist way to look at the card performances using CUDA ? :unsure:

If I’m wrong, then what elements would give me more details on the performances of my kernel under a FX 3600 ?

Thank you in advance !


There are two other factors to consider when predicting performance:

  • Stream processor (aka “shader”) clock rate: I can’t seem to find the shader clock for the FX 3600, but you should also compare that to the 8600. Be sure to look at the shader clock and not the core clock.

  • Memory bandwidth: If your kernel is memory bound (many kernels are), then the performance will scale like the memory bandwidth and not like the number of stream processors.


I’ll look into this.


I have found the following for the Fx 3600:

I also can’t get the information about the stream processors’s frequency.

The memory bandwidth is 51.2GB/sec. compared to 32.0GB/sec. for the Geforce 8600.

Just wanna make sure: by memory bound kernel, you mean the fact of using shared memory to do operations on the device ?

So following the idea that my kernel is memory bound, the performance gain would be of 51.2 GB/sec. / 32 GB/sec. and the number of multiprocessors ( stream processors ) would have no impact on my performances ?

Thank you once again !


Memory bound means you are bound by the global memory bandwidth (the number you listed). If you do any less than 100’s of floating point operations for each one global memory read, then your kernel is most likely memory bound.

Taking the two memory bandwidths and dividing to get an effective speedup should give you a very rough estimate of the performance if your kernels are memory bound. For a concrete example: my app is memory bound and performance 18% faster on 8800 GTX compared to the 8800 GTS (G92). The 8800 GTX memory is 35% faster than the 8800 GTS (G92): 86.4/64.

It seems I’ll not be able to really find out about performance gain until I test it.

My current test kernel is not memory bound, but depending on the technological decisions we’re gonna take and the applications that we’re gonna run on GPU, my CUDA functions will probably be memory-bound.

Thanks for all the info, it’s getting clearer and I’m learning more everyday about GPGPU under CUDA environment :thumbup: