A few Questions related to CUDA and CUBLAS

CudaaduC · February 1, 2013, 2:13am

Been playing with CUDA for about 6 weeks, and have a few questions for the pros out there;

I am still learning how to use CUBLAS, and was wondering what the best method to raise a Matrix to a large power. I know I can use the usual multiplication kernels to achieve the result, but with CPU implementations of Matrix POW they can use ‘exponentiating by squaring’ for a nice speedup. Is there any built in CUBLAS functionality which uses a similar idea to cut down the number of multiplications?
With a single GTX 680 I have a reported Gflops of as much as 996( using cublasSgemm()), which seems really high. I know how this number is calculated, but is there anything misleading in that number?
Given the huge speed advantage of using CUDA with GPUs, why do so few people use GPU for calculations when they make a big difference? One can get a nice speed up from just a $150 GTX 460, but instead they use openMP or parallel CPU clusters which are a pain to manage, generally not as fast, less reliable, and more expensive. Am I missing something?

In general most GPU versions of algorithm I have experimented with have been anywhere from 2X as fast as the equivalent CPU implementation (simple BFS for example on undirected graphs) to 1000x faster than CPU implementations (CUBLAS matrix multiplication). Even sorting large arrays(thrust) can be many times faster. GPU kernels also tend to scale well and the time taken to allocate and copy memory back and forth from the host and device has been really quick, and less of an issue than I thought. The only class of algorithm which seems slower is some of the dynamic programming problems(recurrences), but even then there are a few which test at 10 faster than CPU versions. Also Atomics in Kepler are way faster than advertised.

CUBLAS Matrix multiplication is quick, but the whole ‘Column Major’ aspect is really annoying, since every other time I have had to work with Matrices it has been row major.

Ultimately I am just trying to learn more about CUDA, and am surprised at how little sample code is out there. I really appreciate this forum and hope I do not annoy anyone with these questions and observations.

thanks!

Topic		Replies	Views
why matrixMul from samples so slow? CUDA Programming and Performance	7	5187	June 7, 2010
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18308	March 30, 2011
CUBLAS sgemv slower than CBLAS for small matrix sizes CUDA Programming and Performance	2	1581	February 1, 2010
Raising Matrices to larger powers using CUDA and exponentiating by squaring CUDA Programming and Performance	0	1258	March 17, 2013
How to speed-up matrix multiplication using CUBLAS? CUDA Programming and Performance	6	7624	September 1, 2010
Performance query Odd results profiling GPU speed of matrix multiplication using cublas CUDA Programming and Performance	1	1506	February 12, 2010
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	10171	March 24, 2014
simple matrix (or matrix vector) multiplication using CUBLAS CUDA Programming and Performance	9	5762	November 25, 2009
Why is my cublas so slow and is there anything I can do to fix it? CUDA Programming and Performance	1	1554	June 27, 2018
CUBLAS question a question about performance of CUBLAS CUDA Programming and Performance	4	6056	November 11, 2009

A few Questions related to CUDA and CUBLAS

Related topics