benchmarking GPUs

Is there a program available that measures the GFLOP rating of Nvidia GPUs? I’ve downloaded SDK 2.0 but can’t find a program that does this.

Thanks in advance. :ph34r:

I posted a simple gflops test here a while ago:

Your mileage may vary!

Thanks! I’ll have a look at it. Much appreciated. :ph34r:

Got it working. Ran it on the Tesla C870, and get 319 GFLOP (benchmarking only one GPU). The theoretical GFLOP is 430, so I suppose that’s not too bad.

Cheers :thumbup:

Keep in mind that while the word “GFLOP” is used often, there are precisely two things that are measured by it and talked about. Theoretical performance and LU matrix decomposition (which is mostly matrix-multiply, but with a bunch of data shuffling) aka Linpack. You can try to apply the measure to any sort of code, but it’s not usually a meaningful exercise. Now on the one hand Linpack is only marginally relevant at answering the question of “what’s, really, the achievable performance on a real app” but on the other, an artificial MADDing-registers-endlessly code answers this question even less effectively (although the exercise has its uses).

SO, in summation, the GPU can’t do 319 GLFOP on the Linpack (or on even a single-precision matrix-multiply) and thus you can’t say that it’s the “sustained” performance and advertise it as such to other HPC people. The figure is, really, an “adjusted theoretical” that reveals the fact that NVIDIA has been lying and counting hardware capability that even the most ideal code can’t access.

The numbers quoted in Tesla marketing materials are not “lies” but theoretical peaks. Same thing is done for other processors as well (for example, 102Gflops for Intel x5482 Harpertown Xeon). It’s just the peak issue rate.

Claiming that Linpack (or matrix mul) is the true measure of sustained Gflops rate is not prudent. Noone really cares about GFlops number for apps other than theirs, and Gflops for different apps varies extremely widely.


I don’t think anyone measures the gflops for their apps. Why would someone do that? When optimizing an application, if can you cut a million operations out of your logic, then you won’t care if the average ops per second goes down (eg because instructions-per-memory-accesses ratio decreased).

GFLOPs is only measured for linpack, if only it’s because it’s a convenient reference point and there’s an organization that collects results from most of the powerful architectures. What’s interesting is that while the linpack (primarily, matrix-multiply) is not a complex algorithm, it’s nuanced enough to reflect on the design of the underlying architecture, on the difficulty of optimizing for it, and on the time that’s gone into optimizing its libraries. Wouldn’t someone impartial say that maybe those reasons are being felt when benchmarking CUBLAS performance?

Re: the un-measurable GFLOPS: What’s the point of retredging ancient history? But if we must, NVIDIA quoted ops which simply could not be used in CUDA, either via Cu or PTX. Maybe you’re saying that NVIDIA is allowed to quote those because a few shaders for several high-profile games that its internal team rewrote in assembly had used them? What you guys pulled on the G80 is nothing at all what Intel does with Xeons. Don’t know why you’d even say that. The Xeon’s 102 is just its four SSE units times 3.2 GHz times four cores times two ops per MAD. Intel doesn’t even try to count the old x87 unit (which still works btw). A trivial instruction pump could hit all of them and the fp co-processor too. That’s why I said the code in the other thread is basically the true theoretical figure.

I should also say that the GT200 marketing doesn’t try to pull this stuff anymore. I should also say that the G80 episode was much less egregious than when Sony boldly announced its PS3 could do 2 teraflops on account of texture interpolation. Then again, that GPU was made by NVIDIA also :P

Not sure to what you are referring. All I said in my post that the numbers you’re unhappy with are purely theoretical peaks - hardware issue rate. Applications do not reach the hardware issue rate on any architecture. So, I’d say it’s fair to compare theoretical peak with theoretical peak or sustained app rate vs sustained app rate on different architectures. I don’t recall us comparing our theoretical peak vs another arch’s app rate.


Couldn’t one just interleave linearly interpolated texture lookups with a barrage of multiply/add commands to get close to the theoretical peak? Even if the computation is not necessarily meaningful, it would have the advertised FLOPS.

Texture interpolation is never counted as flops, even by NVIDIA, because it is not programmable. What was being counted was something else, something that was even less accessible from CUDA.

The thing is, even when counting theoretical flops, people typically use judgement because even this ‘theoretical’ number is supposed to mean something relevant. paulius, it’s not just an opportunity to put the nicest figure you can find between the couch cushions up on a slide.

I guess what I’m saying is that your theoretical peak can’t be fairly compared to Intel’s theoretical peak. Does that make sense?