Fast matrix-matrix multiplication (GEMM) for Fermi

e.ping · August 4, 2010, 7:41pm

We developed improved MAGMA BLAS SGEMM and DGEMM routines for Fermi GPUs. The sources are now available through the MAGMA website. The new routines will be part of the up-coming MAGMA 0.3 library release and will be included in CUBLAS 3.2 as well.

The basic algorithm is described in:
Nath, R., Tomov, S., Dongarra, J., An Improved MAGMA GEMM for Fermi GPUs, University of Tennessee Computer Science Technical Report, UT-CS-10-655 (also LAPACK working note 227), July 29, 2010.

On a C2050 GPU the new DGEMM gets up to 300 GFlop/s (58% of peak) and the SGEMM up to 645 (63% of peak). On a GTX480 DGEMM gets up to 166 GFlop/s and SGEMM up to 844 GFlop/s.

Stan Tomov

Sarnath · August 5, 2010, 6:44am

Vow! 844GFlops… Nice!

eyalhir74 · August 5, 2010, 7:10am

Hi, Thanks for the post… :)

Page 7 shows one C2050 gets ~240GFlops for a matrix size of 10112 while Istanbul gets to ~165GFlops.

The Istanbul code ran 48 threads? so one C2050 ran ~30% faster compared to the WHOLE machine?

So basically I could have connected 2 S2050 (totall of 8 Fermis) to that machine and get ~240 * 8 == 1920GFlops

while the WHOLE CPU code would still give ~165GFlops, right?

thanks

eyal

trudger · August 6, 2010, 2:59pm

It looks like I can’t open the web page for the source. Is it accessible currently?

Thanks for sharing!

trudger · August 6, 2010, 3:01pm

It looks like I can’t open the web page for the source. Is it accessible currently?

Thanks for sharing!

trudger · August 6, 2010, 3:02pm

It looks like I can’t open the web page for the source. Is it accessible currently?

Thanks for sharing!

Lev · August 7, 2010, 6:37pm

I can download it. Was the Intel 8-core Nehalem system too expensive to buy? Some critics, what is the point of running MKL on AMD? Was the ACML available? And data with ecc on and off will be usefull too.

Lev · August 7, 2010, 6:37pm

I can download it. Was the Intel 8-core Nehalem system too expensive to buy? Some critics, what is the point of running MKL on AMD? Was the ACML available? And data with ecc on and off will be usefull too.

e.ping · August 9, 2010, 5:00pm

Yes, the Istanbul code ran on 48 cores, and 8 Fermis would be expected to give about 2 TFlop/s on LU.

Stan

e.ping · August 9, 2010, 5:00pm

Yes, the Istanbul code ran on 48 cores, and 8 Fermis would be expected to give about 2 TFlop/s on LU.

Stan

e.ping · August 9, 2010, 5:21pm

These are all very good critics points. Regarding the comparison, we just wanted to put the GPU results in a context of a multicore system and the Istanbul that we have happened to have the same theoretical peak as the Fermi. Using ACML indeed may be better, although the conclusion would be the same. It looks like that by packing so many cores in a homogeneous CPU multicore system, NUMA effects become an issue. ECC was turned off. The effect is about 10%. I assume in future GPUs the effect of ECC on performance would be reduced.

Stan

e.ping · August 9, 2010, 5:21pm

These are all very good critics points. Regarding the comparison, we just wanted to put the GPU results in a context of a multicore system and the Istanbul that we have happened to have the same theoretical peak as the Fermi. Using ACML indeed may be better, although the conclusion would be the same. It looks like that by packing so many cores in a homogeneous CPU multicore system, NUMA effects become an issue. ECC was turned off. The effect is about 10%. I assume in future GPUs the effect of ECC on performance would be reduced.

Stan

Topic		Replies	Views
Reasonable timing with Cublas dgemm and sgemm CUDA Programming and Performance	15	4433	January 14, 2010
Tesla C2050 (Fermi) benchmarking results CUDA Programming and Performance	18	8837	September 22, 2010
speedy CGEMM reaches 448 Gflop/s CUDA Programming and Performance	1	2787	March 22, 2010
poor blas-3/gemm performance on GTX480 CUDA Programming and Performance	3	4236	April 20, 2010
How to disable/enable ECC on C2050? CUDA Programming and Performance	22	14245	April 24, 2010
my speedy SGEMM CUDA Programming and Performance	91	276633	May 29, 2013
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28203	February 1, 2011
cublas sgemm,dgemm performance issue on telsa 10 and gtx 570 GPU-Accelerated Libraries	1	1321	February 24, 2013
CPU+GPU dgemm CUDA Programming and Performance	8	35896	January 31, 2011
CUDA lib performance on Ampere architecture CUDA Programming and Performance	2	881	April 22, 2021

Fast matrix-matrix multiplication (GEMM) for Fermi

Related topics