GeForce 9500 with cuda BLAS

Axida · December 11, 2009, 5:30am

Hi all,
Can I use cuda BLAS library on GeForce9500 GPU?

Thank you!

avidday · December 11, 2009, 8:12am

Yes.

Axida · December 11, 2009, 9:43am

I am using 2 GPUs (Tesla C1060 and GeForce 9500).

I want to use both with cudaBLAS.

Is it possible?

Thank you!

avidday · December 11, 2009, 10:12am

If you are asking whether CUBLAS can automatically use both GPUs within a single BLAS function call, the answer is no. You can, however, use both GPUs with CUBLAS, but only within separate CUDA contexts. So if you want to do multi-gpu with CUBLAS, you need to write host side code to do it yourself.

Axida · December 11, 2009, 1:52pm

Thank you!

What I did is :

cudaSetDevice(0)

dgemm(…) // cuda BLAS on device 0 Tesla

it gave me correct result.

and I tried to change the 0 to 1

cudaSetDevice(1)

dgemm(…) //cuda BLAS on device 1 Geforce 9500

it did not calculate at all.

and no any error there

what is wrong there?

Thank you!

avidday · December 11, 2009, 1:55pm

Your 9500GT doesn’t support double precision floating point arithmetic. If you try calling sgemm instead, it should work.

Axida · December 11, 2009, 2:15pm

Thank you for your instant reply!

I will try it.

Thank you so much!

Axida · December 13, 2009, 2:45am

I have tried sgemm on GT9500 and I did execute it and got correct results.

I am wondering what is the peak performance of GT9500 with single precision?

Because my CPU (Intel Xeon W3520) with 4 threads are faster than GT9500

when I execute sgemm on both CPU and GT9500.

Would you pleas teach me how to calculate the peak performance of GT9500?

Thank you in advance!

Axida

Nikolai · December 13, 2009, 6:14am

hey Axida,

theoretical gflops = (shader core clock (Ghz) * (# of cores) * 2
(the “2” is for MAD, technically the gpu can do MAD & MUL but in practice it doesn’t really happen)

i.e. for 9500 GT, maybe yours is like
1.4 Ghz * 32 cores * 2 = 89.6 Gflops

i don’t know what your theoretical single precision gflops for CPU is, or how close to theoretical your CPU blas reaches (cuBLAS 2.0 reaches ~60% of GPU theoretical)
i think it would seem reasonable for your quad core to beat the 9500 GT… (but i’m pretty new to gpus too :) )

Axida, how fast is the dgemm on your cpu vs the tesla card? (also which blas are you using on cpu)

Axida · December 13, 2009, 7:13am

I am using dgemm from ATLAS on my cpu. I got 48.889449 sec. execution time for M=10000, N=10000, K=10000 matrix sizes.

2xMxNxK/48.889449=40GFlops (right?)

The peak performance of the CPU is 42.72GFlops (4IPCx4coresx2.67).

It means I got 95% of peak performance?

The execution time of Tesla with the same matrix sizes is 31.802184 Sec.

2xMxNxK/31.802184=62.8GFlops

The peak performance of Tesla is 78GFlops (30x2MADx1.30).

It means I got 80% of the peak performance?

Is that right?

Thank you!

Nikolai · December 13, 2009, 7:30am

…well to me it looks right (but like i said i’m new)

hmmm, according to Volkov, V., and Demmel, J. W. Benchmarking GPUs to tune dense linear algebra
volkov was able to achieve 97% w/ dgemm… which i’m assuming later became part of cuBLAS 2.0
but it seems you only achieved ~81%…

Axida · December 14, 2009, 12:34am

I still do not know how to calculate the single precision peak performance of my CPU (Xeon W3520).
Any help?

Axida

avidday · December 14, 2009, 12:42am

It should be double the double precision value, ie. 8 instructions per core per cycle * 2.67Gcycles * 4 cores = 85.44 single precision FLOP/s peak.

Axida · December 14, 2009, 12:53am

I see.

Thank you so much!