We developed improved MAGMA BLAS SGEMM and DGEMM routines for Fermi GPUs. The sources are now available through the MAGMA website. The new routines will be part of the up-coming MAGMA 0.3 library release and will be included in CUBLAS 3.2 as well.
The basic algorithm is described in:
Nath, R., Tomov, S., Dongarra, J., An Improved MAGMA GEMM for Fermi GPUs, University of Tennessee Computer Science Technical Report, UT-CS-10-655 (also LAPACK working note 227), July 29, 2010.
On a C2050 GPU the new DGEMM gets up to 300 GFlop/s (58% of peak) and the SGEMM up to 645 (63% of peak). On a GTX480 DGEMM gets up to 166 GFlop/s and SGEMM up to 844 GFlop/s.
I can download it. Was the Intel 8-core Nehalem system too expensive to buy? Some critics, what is the point of running MKL on AMD? Was the ACML available? And data with ecc on and off will be usefull too.
I can download it. Was the Intel 8-core Nehalem system too expensive to buy? Some critics, what is the point of running MKL on AMD? Was the ACML available? And data with ecc on and off will be usefull too.
These are all very good critics points. Regarding the comparison, we just wanted to put the GPU results in a context of a multicore system and the Istanbul that we have happened to have the same theoretical peak as the Fermi. Using ACML indeed may be better, although the conclusion would be the same. It looks like that by packing so many cores in a homogeneous CPU multicore system, NUMA effects become an issue. ECC was turned off. The effect is about 10%. I assume in future GPUs the effect of ECC on performance would be reduced.
These are all very good critics points. Regarding the comparison, we just wanted to put the GPU results in a context of a multicore system and the Istanbul that we have happened to have the same theoretical peak as the Fermi. Using ACML indeed may be better, although the conclusion would be the same. It looks like that by packing so many cores in a homogeneous CPU multicore system, NUMA effects become an issue. ECC was turned off. The effect is about 10%. I assume in future GPUs the effect of ECC on performance would be reduced.