testing LINPACK performance on HP xw8600+S1070

xinsan · February 11, 2009, 8:16am

Recently I have been testing LINPACK on hp xw8600+NVIDIA Tesla S1070. I use CUBLAS library but performance is very low. The performance of double-precision floating-point calculation is about 10~ GFLOPS, much lower than the peak performance of 4TFLOPS. I think using CUBLAS library is not enough because much functions not only BLAS library functions are included in LINPACK. I am troubled with how to test LINPACK on S1070. Please give me some advice.

E.D_Riedijk · February 11, 2009, 12:43pm

well first of all cublas runs on only 1 GPU. So it is using 1/4 of the hardware. Second is that double precision is slower (around 1/10 of single precision).

xinsan · February 11, 2009, 1:40pm

Yes. You are right. But 10~GFOLPS is still much lower than 345*1/4=86GFLOPS.

E.D_Riedijk · February 11, 2009, 3:54pm

Well, it might be that you are bandwidth bound (either on-device (100 GB/s or CPU->GPU->CPU transfer bound (probably around 3-6 GB/s for S1070, depending on your motherboard))

I don’t know what LINPACK does, if it is LAPACK functions that call CUBLAS functions, then it can improve by using a parallel LAPACK for GPU (there is work on CULAPACK as far as I know).

There is also a lot of work going on on mixed precision algorithms that exploit the fast single precision, followed by a double-precision refinement.

mfatica · February 12, 2009, 1:53am

What is your problem size and block size?

I have done some work on Linpack ( the results will be presented at the GPGPU2 conference in March), a single workstation with a quad core and a C1060 can achieve 90 GFlops, a server connected to an S1070 250 Gflops and a cluster with 16 GPUs well above 1 Teraflops.

E.D_Riedijk · February 12, 2009, 5:41am

How many of those 90 GFLOPS are from the C1060? It will probably be very close to its peak of 78 when you reach 90!