GeForce 9500 with cuda BLAS

Hi all,
Can I use cuda BLAS library on GeForce9500 GPU?

Thank you!


I am using 2 GPUs (Tesla C1060 and GeForce 9500).

I want to use both with cudaBLAS.

Is it possible?

Thank you!

If you are asking whether CUBLAS can automatically use both GPUs within a single BLAS function call, the answer is no. You can, however, use both GPUs with CUBLAS, but only within separate CUDA contexts. So if you want to do multi-gpu with CUBLAS, you need to write host side code to do it yourself.

Thank you!

What I did is :


dgemm(…) // cuda BLAS on device 0 Tesla

it gave me correct result.

and I tried to change the 0 to 1


dgemm(…) //cuda BLAS on device 1 Geforce 9500

it did not calculate at all.

and no any error there

what is wrong there?

Thank you!

Your 9500GT doesn’t support double precision floating point arithmetic. If you try calling sgemm instead, it should work.

Thank you for your instant reply!

I will try it.

Thank you so much!

I have tried sgemm on GT9500 and I did execute it and got correct results.

I am wondering what is the peak performance of GT9500 with single precision?

Because my CPU (Intel Xeon W3520) with 4 threads are faster than GT9500

when I execute sgemm on both CPU and GT9500.

Would you pleas teach me how to calculate the peak performance of GT9500?

Thank you in advance!


hey Axida,

theoretical gflops = (shader core clock (Ghz) * (# of cores) * 2
(the “2” is for MAD, technically the gpu can do MAD & MUL but in practice it doesn’t really happen)

i.e. for 9500 GT, maybe yours is like
1.4 Ghz * 32 cores * 2 = 89.6 Gflops

i don’t know what your theoretical single precision gflops for CPU is, or how close to theoretical your CPU blas reaches (cuBLAS 2.0 reaches ~60% of GPU theoretical)
i think it would seem reasonable for your quad core to beat the 9500 GT… (but i’m pretty new to gpus too :) )

Axida, how fast is the dgemm on your cpu vs the tesla card? (also which blas are you using on cpu)

I am using dgemm from ATLAS on my cpu. I got 48.889449 sec. execution time for M=10000, N=10000, K=10000 matrix sizes.

2xMxNxK/48.889449=40GFlops (right?)

The peak performance of the CPU is 42.72GFlops (4IPCx4coresx2.67).

It means I got 95% of peak performance?

The execution time of Tesla with the same matrix sizes is 31.802184 Sec.


The peak performance of Tesla is 78GFlops (30x2MADx1.30).

It means I got 80% of the peak performance?

Is that right?

Thank you!

…well to me it looks right (but like i said i’m new)

hmmm, according to Volkov, V., and Demmel, J. W. Benchmarking GPUs to tune dense linear algebra
volkov was able to achieve 97% w/ dgemm… which i’m assuming later became part of cuBLAS 2.0
but it seems you only achieved ~81%…

I still do not know how to calculate the single precision peak performance of my CPU (Xeon W3520).
Any help?


It should be double the double precision value, ie. 8 instructions per core per cycle * 2.67Gcycles * 4 cores = 85.44 single precision FLOP/s peak.

I see.

Thank you so much!