compare performance across different GPU cards and how to figure out the frequency the GPU clock?

Hi, all, this is my first post.

I encountered a question when I try to compare the performance of my code against those in a published paper. We use different GPUs, and I am wondering if there is any reliable way to compare them side by side.

In the paper, the program is running on GeForce 8800 GTX, which has 16 multiprocessors and a GPU clock of 570MHz (according to wikipedia).

My workstation has a Quadro FX 4600 GPU, which has 12 multiprocessors. I am not exactly sure about the frequency of its GPU clock. Wiki says it’s 400MHz, but some other websites (http://wize.com/graphics-cards/p290992-pny-quadro-fx4600-768-mb) says it’s 675Mhz.

My first question is, may I just translate my speed x 16/12 to get an estimate of the speed on GeForce 8800 GTX? Has any one done this before and is that reliable?

Also, is there any way to figure out the frequency of the GPU clock, esp. Quadro FX 4600?

Thanks a lot in advance!

There is an example in the CUDA SDK called deviceQuery but it is for FX-4800 and later maybe you are lucky and it runs also in the FX-4600.

My output is:

C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\bin\win64\Release>deviceQuery.exe

deviceQuery.exe Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: “GeForce GTX 275”

CUDA Driver Version: 3.0

CUDA Runtime Version: 3.0

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 1857421312 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 256 bytes

Clock rate: 1.40 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

For verification:

http://www.nvidia.com/object/product_geforce_gtx_275_us.html

The clock is 1.4Ghz so the result is correct.

There are two clocks of interest for CUDA: the “shader” clock and the memory clock. The shader clock (times the # of SPs or “CUDA cores” or whatever people call them these days) sets the floating point performance. The memory clock (times the width of the bus * 2 for DDR) sets the memory bandwidth. The clocks you are quoting are the “core clocks” and don’t tell you anything about performance. For the 8800 GTX, these clocks are 1.35 GHz (shader) and 0.9 GHz (memory, sometimes quoted as 1.8 GHz because they include the DDR factor of 2).

The reason I mention both clocks is that CUDA programs can be compute bound, or memory bandwidth bound, or somewhere in between. More programs are memory bandwidth bound than you might expect. So to compare two cards, you should look at the ratio of shader clock * SPs and also the ratio of memory clock * memory bus width. Generally, the performance difference will be somewhere between those two ratios.

This scaling can be misleading if you compare GPUs with very different capabilities (like 8800 GTX to GTX 480), but in your case the 8800 GTX and the Quadro 4600 use the same generation of GPU, so the scaling argument should be pretty good.

deviceQuery works for all CUDA capable GPUs.

Thank you so much, seibert :rolleyes: This really helps!