CUBLAS question a question about performance of CUBLAS

bluestorm · November 9, 2009, 8:21pm

Hey!

I was wondering if anyone can answer a small question about CUBLAS.
I am using a GTX275 board and have 2 complex matrices of 10241024 x 10241000.
This complex multiplication is taking about 47 miliseconds (using the cublasCgemm() function) on average and I was wondering if there is any way I could speed it up. I need this speed up since I am hoping to have a real time system and hence I should drop this operation bellow 20ms if possible.
I am new to CUDA and just read about constant memory. The first matrix (1024*1024) will not change and hence it can be set by the CPU at start up. If I use texture or constant memory for this, would I see an improvement?
Are the CUBLAS functions open source? is there a link to them somewhere?
Thanks a lot for your help!

avidday · November 9, 2009, 9:28pm

I understand the CUBLAS source is open to registered developers. I also understand it is pretty well optimized. Your multiplication target requires a throughput of something like 367 Gflop/s, which I suspect is going to be very difficult to achieve on GT200 based card.

bluestorm · November 9, 2009, 10:52pm

Thanks a lot for your response.

As mentioned, I am pretty new to this, so it would be great if you could clarify how did you obtained the 367Gflop/s. I get about 4.91 Gflop/s by doing (4 Flops per operation * 102410241000)…but I have no idea if this is how you calculate Gflop/s

Thanks! I will take a look at CUBLAS!

Is there a board that will provide this much? I understand that the GTX275 can do up to around 1TFLOP http://en.wikipedia.org/wiki/GeForce_200_Series) but I also understand that using all cores all time is unlikely. How much should I expect from this board? Is there a way to measure this.

Sorry if I am asking silly questions, but this is really new to me. Any help will be appreciated!

Thanks,

avidday · November 9, 2009, 11:10pm

Your calculation is only an operation count, you need to also consider time - 20 milliseconds in this case. 4.91Gflop in 20 ms = 248Gflop/s. But there should be 6 flops per complex multiply, plus the summation over each vector inner product. I “guesstimated” 7MN^2 total ops, but I am no expert and I don’t know whether it is correct or not. My point was you are expecting almost a factor of 2 performance gain over the cublas implementation, and I very much doubt such a big performance win is still to be had on current hardware…

bluestorm · November 11, 2009, 12:02am

Thanks for your help! this has helped me understand things a lot better!

Topic		Replies	Views
A few Questions related to CUDA and CUBLAS CUDA Programming and Performance	0	909	February 1, 2013
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18198	March 30, 2011
Question about GPU FLops CUDA Programming and Performance cuda , kernel	5	72	November 19, 2024
matrix multiplication can't achieve peak performanc CUDA Programming and Performance	9	2311	April 19, 2012
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28043	February 1, 2011
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	10029	March 24, 2014
CUBLAS sgemv slower than CBLAS for small matrix sizes CUDA Programming and Performance	2	1502	February 1, 2010
Imaging algorithm optimisation CUDA Programming and Performance	6	6686	August 11, 2017
CUBLAS matrix multiplication matrix size limited by GPU memory size CUDA Programming and Performance	8	3496	August 2, 2010
cuBLAS related question CUDA Programming and Performance	16	2909	February 6, 2013

CUBLAS question a question about performance of CUBLAS

Related topics