why matrixMul from samples so slow?

Hi people!
I tried to measure speedup of matrixMul from Cuda SDK Samples on Tesla 1060, warp additional timer on function computeGold.
I get 7,2x speedup vs CPU, it is not enougth. :confused:
what about 10x-100x speedup tales?

If you want to explore higher performance matrix multiplication, try using cublas. The SDK sample is not indented to represent an optimal implementation, it is a programming/learning example.

If you have come into using CUDA thinking everything must be 100x faster, be prepared to be disappointed.

I guess that exist more perfect algorithm than O(n^3) :rolleyes:, and i believe that matrix mul from cublas will be more faster. But what about speedup cublas vs blas?
if i insert any acceleration like loop unroll in GPU-programm i must do the same with CPU-program…
I visit some presentations from Nvidia when says about 10-100x faster GPU-program against CPU-program on some tasks, is there compared powerful libs like cublas against easy CPU-program compiled with -O0 options :rolleyes: ?
I don’t expect great profit from CUDA on any tasks, i ask only simple example like sum vector or scalar product where i can touch 10-100x speed increase

What processor, what blas, and what precision? On the single socket, quad core systems I use, the GT200 and cublas as about twice as fast at DGEMM and 5 times as fast as SGEMM compared to the fastest host blas I have access to.

You can’t reach great speedup on those sorts of tasks - they are not computationally intensive enough and are memory bandwidth limited. The speedup ratio ends up being close to the ratio of host cpu memory bandwidth to GPU bandwidth in those cases, which is normally less than 10 times.

ok, thanks, i have understood cause of memory bandwith limit, it’s really so :((

I have tried to compare sgemm cublas vs blas and reached 22x speedup on GeForce 9600! :rolleyes:

matrix multiplication is computation-intensive, so CUBLAS is faster than blas.

However you may need to notice the dimension of matrix you use.

The performance of CUBLAS is not uniform, depends on the dimension of matrix.

Yeah, but what BLAS and what problem size, and run on what processor, with what and how much memory and how many host threads?