Fast matrix-matrix multiplication (GEMM) for Fermi

We developed improved MAGMA BLAS SGEMM and DGEMM routines for Fermi GPUs. The sources are now available through the MAGMA website. The new routines will be part of the up-coming MAGMA 0.3 library release and will be included in CUBLAS 3.2 as well.

The basic algorithm is described in:
Nath, R., Tomov, S., Dongarra, J., An Improved MAGMA GEMM for Fermi GPUs, University of Tennessee Computer Science Technical Report, UT-CS-10-655 (also LAPACK working note 227), July 29, 2010.

On a C2050 GPU the new DGEMM gets up to 300 GFlop/s (58% of peak) and the SGEMM up to 645 (63% of peak). On a GTX480 DGEMM gets up to 166 GFlop/s and SGEMM up to 844 GFlop/s.

Stan Tomov

Vow! 844GFlops… Nice!

Hi, Thanks for the post… :)

Page 7 shows one C2050 gets ~240GFlops for a matrix size of 10112 while Istanbul gets to ~165GFlops.

The Istanbul code ran 48 threads? so one C2050 ran ~30% faster compared to the WHOLE machine?

So basically I could have connected 2 S2050 (totall of 8 Fermis) to that machine and get ~240 * 8 == 1920GFlops

while the WHOLE CPU code would still give ~165GFlops, right?

thanks

eyal

It looks like I can’t open the web page for the source. Is it accessible currently?

Thanks for sharing!

It looks like I can’t open the web page for the source. Is it accessible currently?

Thanks for sharing!

It looks like I can’t open the web page for the source. Is it accessible currently?

Thanks for sharing!

I can download it. Was the Intel 8-core Nehalem system too expensive to buy? Some critics, what is the point of running MKL on AMD? Was the ACML available? And data with ecc on and off will be usefull too.

I can download it. Was the Intel 8-core Nehalem system too expensive to buy? Some critics, what is the point of running MKL on AMD? Was the ACML available? And data with ecc on and off will be usefull too.

Yes, the Istanbul code ran on 48 cores, and 8 Fermis would be expected to give about 2 TFlop/s on LU.

Stan

Yes, the Istanbul code ran on 48 cores, and 8 Fermis would be expected to give about 2 TFlop/s on LU.

Stan

These are all very good critics points. Regarding the comparison, we just wanted to put the GPU results in a context of a multicore system and the Istanbul that we have happened to have the same theoretical peak as the Fermi. Using ACML indeed may be better, although the conclusion would be the same. It looks like that by packing so many cores in a homogeneous CPU multicore system, NUMA effects become an issue. ECC was turned off. The effect is about 10%. I assume in future GPUs the effect of ECC on performance would be reduced.

Stan

These are all very good critics points. Regarding the comparison, we just wanted to put the GPU results in a context of a multicore system and the Istanbul that we have happened to have the same theoretical peak as the Fermi. Using ACML indeed may be better, although the conclusion would be the same. It looks like that by packing so many cores in a homogeneous CPU multicore system, NUMA effects become an issue. ECC was turned off. The effect is about 10%. I assume in future GPUs the effect of ECC on performance would be reduced.

Stan