A few Questions related to CUDA and CUBLAS

Been playing with CUDA for about 6 weeks, and have a few questions for the pros out there;

  1. I am still learning how to use CUBLAS, and was wondering what the best method to raise a Matrix to a large power. I know I can use the usual multiplication kernels to achieve the result, but with CPU implementations of Matrix POW they can use ‘exponentiating by squaring’ for a nice speedup. Is there any built in CUBLAS functionality which uses a similar idea to cut down the number of multiplications?

  2. With a single GTX 680 I have a reported Gflops of as much as 996( using cublasSgemm()), which seems really high. I know how this number is calculated, but is there anything misleading in that number?

  3. Given the huge speed advantage of using CUDA with GPUs, why do so few people use GPU for calculations when they make a big difference? One can get a nice speed up from just a $150 GTX 460, but instead they use openMP or parallel CPU clusters which are a pain to manage, generally not as fast, less reliable, and more expensive. Am I missing something?

In general most GPU versions of algorithm I have experimented with have been anywhere from 2X as fast as the equivalent CPU implementation (simple BFS for example on undirected graphs) to 1000x faster than CPU implementations (CUBLAS matrix multiplication). Even sorting large arrays(thrust) can be many times faster. GPU kernels also tend to scale well and the time taken to allocate and copy memory back and forth from the host and device has been really quick, and less of an issue than I thought. The only class of algorithm which seems slower is some of the dynamic programming problems(recurrences), but even then there are a few which test at 10 faster than CPU versions. Also Atomics in Kepler are way faster than advertised.

CUBLAS Matrix multiplication is quick, but the whole ‘Column Major’ aspect is really annoying, since every other time I have had to work with Matrices it has been row major.

Ultimately I am just trying to learn more about CUDA, and am surprised at how little sample code is out there. I really appreciate this forum and hope I do not annoy anyone with these questions and observations.