Recently I have been testing LINPACK on hp xw8600+NVIDIA Tesla S1070. I use CUBLAS library but performance is very low. The performance of double-precision floating-point calculation is about 10~ GFLOPS, much lower than the peak performance of 4TFLOPS. I think using CUBLAS library is not enough because much functions not only BLAS library functions are included in LINPACK. I am troubled with how to test LINPACK on S1070. Please give me some advice.
well first of all cublas runs on only 1 GPU. So it is using 1/4 of the hardware. Second is that double precision is slower (around 1/10 of single precision).
Yes. You are right. But 10~GFOLPS is still much lower than 345*1/4=86GFLOPS.
Well, it might be that you are bandwidth bound (either on-device (100 GB/s or CPU->GPU->CPU transfer bound (probably around 3-6 GB/s for S1070, depending on your motherboard))
I don’t know what LINPACK does, if it is LAPACK functions that call CUBLAS functions, then it can improve by using a parallel LAPACK for GPU (there is work on CULAPACK as far as I know).
There is also a lot of work going on on mixed precision algorithms that exploit the fast single precision, followed by a double-precision refinement.
What is your problem size and block size?
I have done some work on Linpack ( the results will be presented at the GPGPU2 conference in March), a single workstation with a quad core and a C1060 can achieve 90 GFlops, a server connected to an S1070 250 Gflops and a cluster with 16 GPUs well above 1 Teraflops.
How many of those 90 GFLOPS are from the C1060? It will probably be very close to its peak of 78 when you reach 90!