CUBLAS Performance Many algorithms perform abysmally

Hi everybody,

I am benchmarking the CUBLAS kernels on the Tesla C870. I got about 120GFLOPS (no I/O) for Sgemm. I got about 30 GFLOPS (again no I/O) for Sgemv - but this is only for matrix sizes of about 8000. Ssymm seems to be slower than Sgemm, and I can only get about 80 GFLOPS without I/O. And there is a significant performance drop for matrices larger than 2000 - I only get 70 GFLOPS. Ssymv is the poorest performer of them all, I get only a few GFLOPS even for matrix sizes of the order of 2000.

I calculate the FLOPS for matrix multiplication and matrix vector multiplication as 2n^3 and 2n^2 respectively.

The only numbers that look good to me are those for Sgemv, the rest are much lesser than I’d expected them to be. Am I doing something terribly wrong? Has anybody else got similar numbers?

I’d appreciate some feedback on this.



I’m getting 120 GFLOPS as well on most matrix sizes using a 8800 GTX.

However I see certain performance peeks in my benchmark, which I want to investigate in the near future.

In my opinion the cublas library is using a fixed blocking pattern, not optimized on the amount of mutliprocessors available and the total shared memory.



From what I keep reading in these forums, you need to place your card in the appropriate PCI-E slot. Therez a talk about x16 etc… Also, You may need to enable something in your BIOS to take advantage of certain features.

Check out. Good Luck!

Best Regards,

At least I was talking about onboard performance without memory copy. So whether the card is attached by x8 or x16 should not matter.



I am also referring to throughputs without memory transfers.

My board is in fact in an x16 slot. I’ve also heard something about the board having to be “first” PCIe device - not sure what this means. I’ll poke around. (BTW, aren’t we supposed to get NVIDIA tech. support if we buy a Tesla? I’ll try and bug somebody at NVIDIA about this)

One possibility that my board might be underpowered, but the good performance of Sgemm contradicts this.

Would you folks be interested in running a set of benchmarks and reporting your numbers if I post some benchmark code over here?


Motherboards vary quite a bit in how many PCI-Express lanes they have, and how they allocate them to the available slots. Adding to the confusion, the PCI-Express standard distinguishes between “physical” (actual size of the board it can accept) and “electrical” (number of active lanes) size of the slot. There are a number of motherboards with two physical x16 slots which are only x16 electrical in the first slot, and x8 in the second. Some motherboards even downgrade both slots to x8 when both slots are full. This is more common on the 4-slot x16 motherboards when the chipset doesn’t actually have enough PCI-Express lanes to drive 4 slots at x16 bandwidth. If you install 4 cards, it drops all the slots to x8.

The important thing is to find out the manufacturer and model # of your motherboard so you can look up its PCI-Express configuration.

(Newer motherboards, like the MSI K9A2, can now operate 4 slots at full x16 bandwidth, though I haven’t seen any test results yet to verify that.)

It means that it has to be device 0 when you run devicequery. If not, you must select the appropriate device in your code. So if you have e.g. a 8800GT card which is listed as first, then programs that do not select the device to use, will actually run on the 8800GT.

If you post code that run on linux I can see what I get on a 8800GTX tomorrow. The numbers I get should be the same as for the Tesla.