FYI - A good Multi-OpenCL benchmark app...

FYI - A good Multi-GPU OpenCL benchmark app…

DirectCompute & OpenCL Benchmark. (By Pat.)

http://www.ngohq.com/graphic-cards/16920-d…-benchmark.html

Version v0.45 is special. It has outstanding Multi-GPU workload balance.

Version v0.44 looked like this loaded up:

First 1/2 of my 295 reporting 61% utilization…

Second 1/2 of my 295 reporting 58% utilization…

280 checking in at a whopping 92% utilization… (Go Dedicated PhysX processor!)

Very light CPU utilization, showing only 2%.

Best score generated on v0.44:

New improved version 0.45, with better workload balancing…

97%, 98%, and 98% GPU utilization… Sweet!

CPU now up from 2%, to 34% utilization.

New High score running v0.45 with all system settings the exact same as used in the v0.44 test. Score is up from C1786.0:

This is a good OpenCL test to show off Multi-GPU Rigs. External Image

We have some OpenCL scores being posted.

http://www.evga.com/forums/tm.aspx?high=&a…p;mpage=1#89761

So that means:

A 8800 GTS and a single 4850 produces around C453.4

A single 260 produces around C707.3

A single XFX HD 5770 1GB produces around C1042.9

A single 295 produces around C1431 using both sides of the GPU…

A single 4890 produces around C2350

A single 295 and single 280 produce around C2575

Luv’s single 5870 produces around C4405

I noticed Pat had posted on page 2…

"Setting different profiles for CPU and OpenCL does not mean anything so you got almost the same results (it’s hard to get the same results for CPU because of background tasks)

The profile combobox is only enabled in DirectCompute tests and force the DirectX shaders compiler to build the GPU code for specific shader model.

The score you get is simply the number of mega kernel loops (10^6) per second that your CPU can process (using 12 threads). Higher number = better CPU performance.

The scores for different APIs are comparable so getting C1000 and M10 means your graphic card can handle 100x more calculations per second than your CPU. Thats mainly because the GPU can process thousands of threads at the same time without threads switching and the CPU usually can process 2, 4 or 8 threads."

Question: If scores for both CPU’s and GPU’s are generated by counting mega kernel loops (10^6) per second…

I know Nvidia Shaders do more work in 1 clock cycle than ATI.

I wonder if just counting kernel loops will equate to real world performance, when comparing ATI to Nvidia in OpenCL apps?

I still have a hard time accepting that a single 5870 would actually deliver more performance, than a 295 and 280 working together, all with high utilization.

I think the app gives accurate performance info when comparing Nvidia to Nvidia, or ATI to ATI, but am still not sure about comparing Nvidia to ATI.

The ‘counting kernel loops’ thing has me wondering now… :)

Opinions are welcome…