Finding the theoretical FLOPS of an OpenCL device Is there a way to find the theoretical maximum FLO

I’ve been delving into OpenCL and I’m wondering if there is a way to find the theoretical FLOPS of the OpenCL devices on a host. When you’re deploying code, you would want it to run on the fastest device present and be able to choose this dynamically based on the user’s hardware.

I was unable to find a forum post about this.

I thought I would be able to calculate FLOPS based on information that can be drawn from the clGetDeviceInfo function. Unfortunetly requesting the number of compute units does not give you the number of processors of the device (in all cases). In nVidia’s case the number of compute units is the number of streaming multi-processors (not stream processors like I would hope).

I was hoping to use an equation like this to calculate FLOPS
FLOPS = ClockRate * ALUs * FLOOpsPerClockCycle.

But without a definite number for the number of ALUs in a compute device, doing it this way becomes less possible. I might beable to take the number of Compute Units and multiply it by 8 but then I would need to introduce a case statement for Fermi Devices and multiply it by 32 in that case. But then I’m not even taking into account ATI devices (which some people may use) at that point the determination metrics become very un-elagant and unable to be used if there are architecture changes in the future.

Has anyone found a good way to do this, how would you go about choosing a device to attach a context, Command Queue, kernel etc. to?

What about running sample kernel(s) on each platforms and devices and measure actual perfromance?

Wouldn’t that take too long? And would that give accurate (consistent) results?

I think it will give much more accurate results if you run actual kernels. In comparison with comparing theoretical performance. Especially if you skip the 1st run of sample kernel to not take into account lazy buffer allocations.

What would you suggest that I run? Something with just a bunch of raw operations like a matrix dot product? If that’s the case, what would an appropriate size be? In CUDA, you don’t actually get an advantage on a GPU until you start doing dot products between matricies of several hundered elements wide by several hundred elements high.

You said “When you’re deploying code, you would want it to run on the fastest device present and be able to choose this dynamically based on the user’s hardware”. So when you deploy the code you would run it (or representative sample of the code) on all available devices. Or you might want to do it when user requests so when, for example, new device is added to the system. Or OpenCL drivers are updated. Or for any other reason.

There’s other problem that may lead to dynamic selection using the actual code instead theorical or event real-world Gflops: the ratio between raw GFlops computing power and memory bandwidth is changing generation after generation, and many CUDA & OpenCL programs are memory speed bound, so the GFlops is not a good metric to determine the real potential of a GPU for any program.

Give your programm a double run, a first for lazy allocation and also to enable the GPU to go to full speed (depending on the platform it may take seconds for that!), then a run to measure the real-world efficiency of this device with your own program and a representative test-set.