CULA's Initial Fermi (Tesla C2050) Benchmarks Plug and play double precision performance gains

We think the CUDA community will be pleased to know that the CULA team (aka CUDA LAPACK) has released a few of our double precsion Fermi benchmarks on our developer’s blog. We believe these are the first Tesla C2050 benchmarks publicly available for a real world product.

It’s important to note that these are only “plug and play” benchmarks where we have simply installed the new card, updated the drivers, and made a few code changes to be compatible with the new version of NVCC. This is without any additional Fermi tuning including making use of the increased shared memory pool, the increased number of threads, explicit cache management, or the ability to launch concurrent kernels. We are sure that our performance will increase even more once these new features are utilized.

If you have any questions, feel free to ask here or on the CULA forums.

Do you plan to post Fermi 4xx benchmarks too?

I imagine the first bulk of our benchmarks will be on the C2050 as our software is designed specifically for HPC where the Tesla line of GPUs are more prevalent. I’m sure we’ll have 4XX numbers at some point down the line, but right now we only have C2050 hardware.

Cool, in which bar did you say you found this Tesla card in?

Just kiddin, of course. I suppose you heard about the iphone 4G incident in Redwood city… ;)

Thanks for posting this. Could you add a link to the CULA benchmarks themselves.

Is it possible to do the same for single precision? I know that the big gain is in double precision, but still single precision is very interesting too.

There is a known bug in CUBLAS’s single precision GEMM that is hurting performance by about 30%. Once this is resolved, we’ll post single precision benchmarks.

Okay, I’m not a LAPACK expert, but I can imagine that GEMM is used a lot in LAPACK functions ;)

This is true. It’s why GEMM is by FAR the most important important BLAS routine. In fact, a large number of BLAS routines can be expressed with creative use of just (or 95%) GEMM.

Sorry for asking the same question - I see on your Developers’ Blog that these numbers are from the “publicly available benchmark suite”, but I don’t see the link.

We have a benchmark tool included with our installer. Check in the “examples/benchmark” folder. There is a pre-compiled version to test against Intel’s MKL and also source to compile and link in any other LAPACK library if you’d like.

There are several performance optimizations for Fermi-based GPUs that we will introduce in the cuBLAS and cuFFT in CUDA 3.1 and future releases.

We are still doing a lot of performance tuning in the compiler, driver, and libraries. These changes will be rolled in over the next few CUDA releases.

Sumit


Sumit Gupta
Sr. Manager - Tesla GPU HPC Group
NVIDIA

Attend the GPU Technology Conference (GTC) 2010
Learn more at http://www.nvidia.com/gtc