MatrixMultiplication GPU 1080Ti/CompilationOptimization?

I have recently installed 2 GTX 1080Ti on a X99 motherboard equipped with i7-20cores… I want to dispatch matrix multiplication job on cpu and GPU and waited for a very fast calculation on the GPU…
A little program in F90 showed that computing time is nearly equivalent on CPU (using OPENBLAS) and on GPU (OPENACC or CUBLASdgemm)… nearly 5 seconds for 10000x10000…
Trying the same program on a mac equipped with a quadro 4000 is only 3 times slower…(surprising as 1080Ti seems to be much more performant).
I feel that I do no use 1080Ti in an optimized way!!!
Can some one help me ?
Thanks a lot
Pierre

Do your benchmarking timings include the transfer times of the matrices via PCIe bus?

Christian

Try sgemm instead of dgemm. Your 1080Ti’s double throughput is only 1/32th of the single throughput, putting it’s double throughput in the same ballpark as your 20-core i7’s.

Thanks Christian, the problem doesn’t come from the data transfert ( about 5% of the total computation time)

Thanks to you…
Using sgemm is ok and gpu is faster than cpu, but
unfortunately we need double precision.
Is it a problem of compilation with the correct pascal architecture code
in PGI fortran.
Thanks

If you need double precision, the 1080TI isn’t going to be where you find it. Tesla GPUs are going to be your best bet.

For example, the P100 or KX0 series Tesla GPUs.

The difference is substantial, the 1080ti has roughly 350GFlops of DP performance while the K20 the worst of the Tesla GPUs for DP that I suggested has close to 1200 GFlops of DP performance.

The P100 has over 4 TFlops of DP performance.

(I neglected the K10 as it does not have good DP performance)

The older Kepler based GTX Titan and Titan Black 6GB models had unlocked DP throughput. You might find them quite cheap as used models.

However the modern Maxwell and Pascal based Tesla offerings would offer higher DP throughput at lower power consumption (but at high cost)

That’s the series I was looking for! I knew older gen titans supported DP but I only looked at Maxwell and forget to look at Kepler.

Thanks to all (bha4395, cbuchner1, tera,christian) for your prompt and kind responses !!!
Apparently, if I understand you correctly, I got a little rushed in the purchase of the 2 1080 Ti.
To make scientific computation in double precision (at low prices), it seems that titans X black
would have been preferable if i understand well…
Do you know if the drivers for such cards are easy to set to get a FP 1/3 that is nearly equivalent to K40c?
(ArrayFire site: explaining FP64 performance on GPUs).
Do you know if the drivers set in ArrayFire or Magma can be seated to optimize the computation capability of
the cards (1080 Ti or TitanX)…
Thanks a lot to all for your help to beginners in GPUs programming (that seems to be a fabulous world)
Very Friendly greetings from Paris
Pierre

I’m slightly confused by your question.

Are you looking for a GPU that has the FP performance of 1/3 of the K40?

https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units

This wiki page has listed FP and DP performance for all NVIDIA GPUs.
Maybe this will be helpful for whatever you are looking for.

The 1/3 is referring to the idea that Kepler family products come in 2 categories:

  1. Those whose DP throughput is 1/24 of the SP throughput. An example is K10
  2. Those whose DP throughput is 1/3 of the SP throughput. Examples are K20, K40, K80. Certain Titan family members are in this category as well.

Thanks txbob. That makes a lot of sense. Would have never known it otherwise.

of the top of my head, only first titan had good DP performance

original Titan, Titan Black, and Titan Z all had the possibility for elevated DP perf.