why matrixMul from samples so slow?

Dmitry_Babiy · June 1, 2010, 9:17am

Hi people!
I tried to measure speedup of matrixMul from Cuda SDK Samples on Tesla 1060, warp additional timer on function computeGold.
I get 7,2x speedup vs CPU, it is not enougth. External Image
what about 10x-100x speedup tales?

avidday · June 1, 2010, 9:49am

If you want to explore higher performance matrix multiplication, try using cublas. The SDK sample is not indented to represent an optimal implementation, it is a programming/learning example.

If you have come into using CUDA thinking everything must be 100x faster, be prepared to be disappointed.

Dmitry_Babiy · June 1, 2010, 11:59am

I guess that exist more perfect algorithm than O(n^3) :rolleyes:, and i believe that matrix mul from cublas will be more faster. But what about speedup cublas vs blas?
if i insert any acceleration like loop unroll in GPU-programm i must do the same with CPU-program…
I visit some presentations from Nvidia when says about 10-100x faster GPU-program against CPU-program on some tasks, is there compared powerful libs like cublas against easy CPU-program compiled with -O0 options :rolleyes: ?
I don’t expect great profit from CUDA on any tasks, i ask only simple example like sum vector or scalar product where i can touch 10-100x speed increase

avidday · June 1, 2010, 1:13pm

What processor, what blas, and what precision? On the single socket, quad core systems I use, the GT200 and cublas as about twice as fast at DGEMM and 5 times as fast as SGEMM compared to the fastest host blas I have access to.

You can’t reach great speedup on those sorts of tasks - they are not computationally intensive enough and are memory bandwidth limited. The speedup ratio ends up being close to the ratio of host cpu memory bandwidth to GPU bandwidth in those cases, which is normally less than 10 times.

Dmitry_Babiy · June 4, 2010, 9:09am

ok, thanks, i have understood cause of memory bandwith limit, it’s really so :((

Dmitry_Babiy · June 7, 2010, 5:01am

I have tried to compare sgemm cublas vs blas and reached 22x speedup on GeForce 9600! :rolleyes:

LSChien · June 7, 2010, 5:28am

matrix multiplication is computation-intensive, so CUBLAS is faster than blas.

However you may need to notice the dimension of matrix you use.

The performance of CUBLAS is not uniform, depends on the dimension of matrix.

avidday · June 7, 2010, 7:29am

Yeah, but what BLAS and what problem size, and run on what processor, with what and how much memory and how many host threads?

Topic		Replies	Views
A few Questions related to CUDA and CUBLAS CUDA Programming and Performance	0	940	February 1, 2013
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	10169	March 24, 2014
CUBLAS sgemv slower than CBLAS for small matrix sizes CUDA Programming and Performance	2	1581	February 1, 2010
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18306	March 30, 2011
Slow CUDA SGEMM CUDA Programming and Performance	5	763	September 15, 2022
speed up "thin" matrix multiplications in cuBLAS CUDA Programming and Performance	4	1837	January 29, 2016
Performance query Odd results profiling GPU speed of matrix multiplication using cublas CUDA Programming and Performance	1	1506	February 12, 2010
Why is my cublas so slow and is there anything I can do to fix it? CUDA Programming and Performance	1	1553	June 27, 2018
CuBLAS Showing Poor Performance CUDA Programming and Performance	6	1298	December 20, 2013
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28215	February 1, 2011

why matrixMul from samples so slow?

Related topics