Blas Operations Performance of Cublas Operations

Nvidia GPU : G80 GTS , Intel CPU : Core 2 duo E6600, 2.4 Ghz, 2 GB RAM

Windows XP, latest nvidia drivers, Matlab 2007a

I would like to have confirmed the maximum performance without the transfer time - only computation time:

Matrix Dim N = 5000;

          NVIDIA:        Matlab:

Blas 1 : 50 MFlops 2 GFlops

Blas 2 : 12 GFlops 9 GFlops

Blas 3 : 65 GFlops 10 GFlops

why are the blas 1 operations so slow ?
can we achieve higher performances when we implement it by usign the CUDA Runtime API ?

thx, cem

Using CUDA 1.0, 8800 GTX, and the CUDA profiler, we get the following data for BLAS1:

n=5000 aligned unaligned

SAXPY 4.352 usec -> 2.30 GFLOPS 12.352 usec -> 0.81 GFLOPS
SDOT 6.432 usec -> 1.55 GFLOPS
SSCAL 3.520 usec -> 1.42 GFLOPS
ISAMAX 8.960 usec -> 0.56 GFLOPS
SCOPY 3.904 usec -> n/a GFLOPS
SSWAP 4.768 usec -> n/a GFLOPS

For short vectors, the G80 cannot hide memory latency.
For the CPU case, with short vectors like these, the data is likely to be in the cache.

G80 needs very long vectors (e.g. 5M elements) to exhaust the available bandwidth.
At 5M elements, SAXPY reaches 10.74 GFLOPS, 64.47 GB/sec on the 8800GTX.