CUBLAS - low performance on matrix multiplication

Artur_Pchelkin · March 18, 2010, 9:16am

Hi,

I installed CUDA with the latest drivers&SDK and tried matrix multiplication examples (I have winXP x32). I obtained low performance with CUBLAS (see matrixMulDrv.exe). It showed the same performance as matrix multiplication example (without driver) (see matrixMul.exe). I tried different sizes, e.g. for matrix 2k*2k - both of them gives approximatly the same performance = 48Gf/s (it only 17%). Am I doing anything wrong? Or this was expected result?

Thanks,

Artur

Info about my device:

[codebox]CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: “GeForce 9600 GT”

CUDA Driver Version: 3.0

CUDA Runtime Version: 2.30

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 1

Total amount of global memory: 1073414144 bytes

Number of multiprocessors: 8

Number of cores: 64

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.50 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: Yes

Integrated: No

Support host page-locked memory mapping: No

Compute mode: Default (multiple host threads

can use this device simultaneously)

[/codebox]

MMB · March 26, 2010, 3:13am

Hi,

I installed CUDA with the latest drivers&SDK and tried matrix multiplication examples (I have winXP x32). I obtained low performance with CUBLAS (see matrixMulDrv.exe). It showed the same performance as matrix multiplication example (without driver) (see matrixMul.exe). I tried different sizes, e.g. for matrix 2k*2k - both of them gives approximatly the same performance = 48Gf/s (it only 17%). Am I doing anything wrong? Or this was expected result?

Thanks,

Artur

Info about my device:

[codebox]CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: “GeForce 9600 GT”

CUDA Driver Version: 3.0

CUDA Runtime Version: 2.30

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 1

Total amount of global memory: 1073414144 bytes

Number of multiprocessors: 8

Number of cores: 64

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.50 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: Yes

Integrated: No

Support host page-locked memory mapping: No

Compute mode: Default (multiple host threads

can use this device simultaneously)

[/codebox]

If you look at various published results, you will see that a 2k x 2k matrix is only slightly better than the host CPU. Try using 10k x 10k to see a difference.

Jimmy_Pettersson · March 26, 2010, 9:23am

There might be faster solutions than cublas, se for example [url=“http://forums.nvidia.com/index.php?showtopic=159033&st=20”]The Official NVIDIA Forums | NVIDIA ( but i think this is mainly imrovement for complex numbers ).

X-FrEaK · March 28, 2011, 1:55pm

Sorry for reopening this thread, but i found this post intriguing:

If you try 10k x 10k, you will have to allocate 10000*10000 elements * 4 bytes(considering we are working with float) * 3(the number of matrices you must allocate). All things considered, you have to allocate 1,2 Gb, how is that possible? I am dealing with the same kind of situation, so i would like to know if what i saying is right or wrong.

avidday · March 28, 2011, 7:56pm

True

Using a card with more than 1.2Gb of memory, perhaps. There have been/are plenty of CUDA capable cards on the market with 2,3,4 or 6Gb of memory. There are also some pretty straightforward ways to do dense matrix multiplication when the “B” and “C” matrices of the standard gemm

C <- A.B + alpha*C

are too large to fit in the available GPU memory. I do gemm operations where the total size of the problem is 6Gb on an 1280Mb consumer Geforce all the time.

YDD · March 29, 2011, 2:11pm

I even wrote a library to do it. :)

X-FrEaK · March 29, 2011, 10:29pm

Does it work with non square matrices?

YDD · March 30, 2011, 1:26pm

Of course. Although I wouldn’t guarantee that the slicing is always optimal. It’s been a while since I tried running it.

Topic		Replies	Views
Matrix-Matrix Multiplication Accuracy and Performance Questions CUDA Programming and Performance	13	6670	April 16, 2007
cublas large matrix multiplication large matrices won't compute CUDA Programming and Performance	4	3569	January 17, 2008
simple matrix (or matrix vector) multiplication using CUBLAS CUDA Programming and Performance	9	5698	November 25, 2009
why matrixMul from samples so slow? CUDA Programming and Performance	7	5139	June 7, 2010
How to speed-up matrix multiplication using CUBLAS? CUDA Programming and Performance	6	7571	September 1, 2010
CUBLAS Configuration The use of CUBLAS for small matrix CUDA Programming and Performance	3	3772	April 4, 2007
Matrix Multiplication Help CUDA Programming and Performance	5	3894	August 19, 2009
Disappointing CuBlas performance CUDA Programming and Performance	1	7302	February 26, 2007
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28150	February 1, 2011
CUBLAS sgemv slower than CBLAS for small matrix sizes CUDA Programming and Performance	2	1548	February 1, 2010

CUBLAS - low performance on matrix multiplication

Related topics