CUDA SDK exemples benchmarks

I have not a 8800 so far and I can’t find timings results of SDK exemples.
Can somebody post a link or the results for the SDK exemples


I Need to appreciate the GPU speed VS CPU implementation.

Thank you.

Hi. I was interested in sdot and saxpy so I did some benchmarking, see the attached files. The benchmark includes a Core 2 Duo 2.4GHz inside a desktop with 2G of PC2-6400, a P4 (Prescott 2M) 3GHz also with 2G of PC2-6400 and a Core Duo (Yonah) 1.66 inside a laptop, with 2GB PC2-5300. The actual bandwidth for the Core 2 Duo and P4 is around 5GB/s and for the Core Duo is below 4GB/s. The sdot and saxpy use CUBLAS for the GPU and Intel Compiler 9.1 + MKL 9.0 on the CPU. Note that for the ddot and saxpy, the number of elements in the vectors is multiple of 2560. For other value there will be a VERY HIGH drop in performance (up to 75%). I also have an EVGA 768-P2-N835-AR which is a little factory overclocked.

Anyway, for BLAS 1 and 2 functions which are memory-bounded, the performance is directly proportional with the bandwidth. Since the EVGA has a peak of 96GB/sec while the CPU has 6.4GB/sec …

PS. Question for others on the board: why is the peak performance for ddot and saxpy reached only multiples of 2560? Are there any such “sweetspots” for other CUBLAS functions ?