CUBLAS - low performance on matrix multiplication


I installed CUDA with the latest drivers&SDK and tried matrix multiplication examples (I have winXP x32). I obtained low performance with CUBLAS (see matrixMulDrv.exe). It showed the same performance as matrix multiplication example (without driver) (see matrixMul.exe). I tried different sizes, e.g. for matrix 2k*2k - both of them gives approximatly the same performance = 48Gf/s (it only 17%). Am I doing anything wrong? Or this was expected result?



Info about my device:

[codebox]CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: “GeForce 9600 GT”

CUDA Driver Version: 3.0

CUDA Runtime Version: 2.30

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 1

Total amount of global memory: 1073414144 bytes

Number of multiprocessors: 8

Number of cores: 64

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.50 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: Yes

Integrated: No

Support host page-locked memory mapping: No

Compute mode: Default (multiple host threads

can use this device simultaneously)


If you look at various published results, you will see that a 2k x 2k matrix is only slightly better than the host CPU. Try using 10k x 10k to see a difference.

There might be faster solutions than cublas, se for example…59033&st=20 ( but i think this is mainly imrovement for complex numbers ).

Sorry for reopening this thread, but i found this post intriguing:

If you try 10k x 10k, you will have to allocate 10000*10000 elements * 4 bytes(considering we are working with float) * 3(the number of matrices you must allocate). All things considered, you have to allocate 1,2 Gb, how is that possible? I am dealing with the same kind of situation, so i would like to know if what i saying is right or wrong.


Using a card with more than 1.2Gb of memory, perhaps. There have been/are plenty of CUDA capable cards on the market with 2,3,4 or 6Gb of memory. There are also some pretty straightforward ways to do dense matrix multiplication when the “B” and “C” matrices of the standard gemm

C <- A.B + alpha*C

are too large to fit in the available GPU memory. I do gemm operations where the total size of the problem is 6Gb on an 1280Mb consumer Geforce all the time.

I even wrote a library to do it. :)

Does it work with non square matrices?

Of course. Although I wouldn’t guarantee that the slicing is always optimal. It’s been a while since I tried running it.