CUBLAS question a question about performance of CUBLAS

Hey!

I was wondering if anyone can answer a small question about CUBLAS.
I am using a GTX275 board and have 2 complex matrices of 10241024 x 10241000.
This complex multiplication is taking about 47 miliseconds (using the cublasCgemm() function) on average and I was wondering if there is any way I could speed it up. I need this speed up since I am hoping to have a real time system and hence I should drop this operation bellow 20ms if possible.
I am new to CUDA and just read about constant memory. The first matrix (1024*1024) will not change and hence it can be set by the CPU at start up. If I use texture or constant memory for this, would I see an improvement?
Are the CUBLAS functions open source? is there a link to them somewhere?
Thanks a lot for your help!

I understand the CUBLAS source is open to registered developers. I also understand it is pretty well optimized. Your multiplication target requires a throughput of something like 367 Gflop/s, which I suspect is going to be very difficult to achieve on GT200 based card.

Thanks a lot for your response.

As mentioned, I am pretty new to this, so it would be great if you could clarify how did you obtained the 367Gflop/s. I get about 4.91 Gflop/s by doing (4 Flops per operation * 102410241000)…but I have no idea if this is how you calculate Gflop/s

Thanks! I will take a look at CUBLAS!

Is there a board that will provide this much? I understand that the GTX275 can do up to around 1TFLOP http://en.wikipedia.org/wiki/GeForce_200_Series) but I also understand that using all cores all time is unlikely. How much should I expect from this board? Is there a way to measure this.

Sorry if I am asking silly questions, but this is really new to me. Any help will be appreciated!

Thanks,

Your calculation is only an operation count, you need to also consider time - 20 milliseconds in this case. 4.91Gflop in 20 ms = 248Gflop/s. But there should be 6 flops per complex multiply, plus the summation over each vector inner product. I “guesstimated” 7MN^2 total ops, but I am no expert and I don’t know whether it is correct or not. My point was you are expecting almost a factor of 2 performance gain over the cublas implementation, and I very much doubt such a big performance win is still to be had on current hardware…

Thanks for your help! this has helped me understand things a lot better!