Hi,
I installed CUDA with the latest drivers&SDK and tried matrix multiplication examples (I have winXP x32). I obtained low performance with CUBLAS (see matrixMulDrv.exe). It showed the same performance as matrix multiplication example (without driver) (see matrixMul.exe). I tried different sizes, e.g. for matrix 2k*2k - both of them gives approximatly the same performance = 48Gf/s (it only 17%). Am I doing anything wrong? Or this was expected result?
Thanks,
Artur
Info about my device:
[codebox]CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: “GeForce 9600 GT”
CUDA Driver Version: 3.0
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 1073414144 bytes
Number of multiprocessors: 8
Number of cores: 64
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.50 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: No
Compute mode: Default (multiple host threads
can use this device simultaneously)
[/codebox]