Blas Operations Performance of Cublas Operations

sicb0161 · August 20, 2007, 11:36am

Nvidia GPU : G80 GTS , Intel CPU : Core 2 duo E6600, 2.4 Ghz, 2 GB RAM

Windows XP, latest nvidia drivers, Matlab 2007a

I would like to have confirmed the maximum performance without the transfer time - only computation time:

Matrix Dim N = 5000;

          NVIDIA:        Matlab:

Blas 1 : 50 MFlops 2 GFlops

Blas 2 : 12 GFlops 9 GFlops

Blas 3 : 65 GFlops 10 GFlops

why are the blas 1 operations so slow ?
can we achieve higher performances when we implement it by usign the CUDA Runtime API ?

thx, cem

mfatica · August 20, 2007, 8:01pm

Using CUDA 1.0, 8800 GTX, and the CUDA profiler, we get the following data for BLAS1:

n=5000 aligned unaligned

SAXPY 4.352 usec → 2.30 GFLOPS 12.352 usec → 0.81 GFLOPS
SDOT 6.432 usec → 1.55 GFLOPS
SSCAL 3.520 usec → 1.42 GFLOPS
ISAMAX 8.960 usec → 0.56 GFLOPS
SCOPY 3.904 usec → n/a GFLOPS
SSWAP 4.768 usec → n/a GFLOPS

For short vectors, the G80 cannot hide memory latency.
For the CPU case, with short vectors like these, the data is likely to be in the cache.

G80 needs very long vectors (e.g. 5M elements) to exhaust the available bandwidth.
At 5M elements, SAXPY reaches 10.74 GFLOPS, 64.47 GB/sec on the 8800GTX.

Topic		Replies	Views
disappointing CUDA BLAS performance CUDA Programming and Performance	1	1217	April 17, 2009
CUBLAS question a question about performance of CUBLAS CUDA Programming and Performance	4	5981	November 11, 2009
CUBLAS sgemv slower than CBLAS for small matrix sizes CUDA Programming and Performance	2	1502	February 1, 2010
CUBLAS SGEMM performance CUDA Programming and Performance	5	10683	October 5, 2007
Dissatisfying GFLOPs How to improve? CUDA Programming and Performance	0	3122	July 20, 2009
Computation efficiency of a Quadro P620 Nvidia card CUDA Programming and Performance cuda , kernel	0	364	December 28, 2021
[Matrix Multiplication] GFlops on Nvidia Quadro FX 1700.... CUDA Programming and Performance	5	7763	April 16, 2010
Question about GPU FLops CUDA Programming and Performance cuda , kernel	5	64	November 19, 2024
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	10020	March 24, 2014
CUBLAS performance issues CUDA Programming and Performance	3	2657	March 21, 2008

Blas Operations Performance of Cublas Operations

Related topics