cuBLAS related question

I would like to ask a question on the internals of cuBLAS which should have been mentioned in the Manual.
I am dealing with extremely big matrices and memory limitation is an issue. So a normal question would be when you apply CUBLAS_OP_T to a matrix in sgemm, dgemm or whatever function accepts it does cuBLAS create a new matrix internally and multiply or does it use some other technique. If it does not create a new Matrix then I can redesign my algorithm to increase the memory I can allocate. This should have been mentioned in the Manual. Its a very important issue.

Anyway my guess is that it does not create the whole transpose matrix but it does some reordering of elements. There are also no statistics of using CUBLAS_OP_T. How much does it degrade performance of the multiplication. I am talking about a matrix of 540000x400 elements. Of course needless to say I work with a Tesla card M2075 to do such big calculations. In my algorithm I totally avoid CUBLAS_OP_T at the expense of memory. The problem is that there are NO STATISTICS showing how CUBLAS_OP_T degrades performance on such big matrices. Should I go ahead and use it with a guarantee that the degrade will be insignificant compared to using just CUBLAS_OP_N?

I would recommend timing the GEMM variants you would like to use. In general, GEMM handles all transposition modes at “full speed”. However, the various combinations of the transposition of A and B matrices lend themselves to different optimization approaches and thus some performance variation is inevitable. In the cases involving large matrices that I have measured, those differences have been less than 10%. By the way: the transpose=(N,N) combination wasn’t always the fastest.

Here is a quick example of DGEMM performance on a K20c, for all four transpose mode combinations.
The difference between the slowest and the fastest case is 10.8%. Here the (N,T) combination is the fastest variant. As I recall from my own past work on CUBLAS this transpose combination lends itself most easily to effective optimization. All four modes run at full speed, around 1 TFLOPS.

args: ta=N tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13280010 sec GFLOPS=1034.93

args: ta=T tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13872910 sec GFLOPS=990.7

args: ta=N tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.12521601 sec GFLOPS=1097.61

args: ta=T tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13652611 sec GFLOPS=1006.69

Note that GEMMs on matrices with extreme aspect ratio could have lower performance than GEMMs on square-ish matrices. I am mentioning this because your matrices have an aspect ratio in excess of 1000:1. In any event, you can easily measure the performance of any transpose mode combination and size you plan to use.

Thanks for the timings. I carefully examined the structure of my matrices and managed to squeeze things to a 20 times smaller matrix. I will also use cuSparse for my computations. I will bring the Tesla card to its limits…I have to squeeze a Hyperspectral Image to the GPU and its a non-linear optimization with linear inequality and equality constraints. The matrix of the inequality constraints can get really huge. Luckily it has reundancy and its non random, I know the indices of the non zero elements. So all in all I am pushing the envelope making the impossible claimed by many Researchers to possible. Anyway thanks for the input.

I assume you are aware that sparse matrix computation is typically bound by memory bandwidth, rather than being bound by computational throughput as in dense matrix operations like GEMM.

Yes but the size reduction in memory accesses is much much lower than the full matrix. we are talking about 1000000x400 accesses compared to 50000 x 400 accesses in the sparse representation…so the use of cuSparse is justified and I will not overdo it using it.

Yep I can prowdly say that I made the impossible possible …

Good to hear it all worked out well. Maybe you could share a pointer to your work once it is published.

Yes it worked out prestty well 50 times faster than the serial implementation and 10 times less memory.
Yes I will share a pointer once published.

Also in cuSparse in csrmv, this is the only function I am calling. I have eliminated the need for any sparse matrix x matrix multiplication. Does anybody know what kind of optimization is followed? Does it exploit the structure of the matrix to achieve coalescing?

Needless to say there is no option for sparse x sparse matrix multiplication only sparse x dense. So I am lucky that I do not need any.

Anyway as a conclusion in my Research the libraries provided by CUDA SDK were sufficient for this very difficult I got involved and I will mention it in my paper. cuBLAS, cuSparse, Thrust.

It is good to hear that the effort invested into libraries is paying off for our customers. The library team is always interested in getting feedback on what functionality works well, and where additional functionality or performance could be helpful.

Unfortunately I cant share more of my experience on this project due to reasons you may understand. When its done and published I might request a place in the CUDA spotlight. There I can describe my experience with CUDA and how it helped me to achieve the so claimed impossible. So I will come in contact with you in a few months. Thanks for your time, help and feedback.

I would like to make a point quoting again the phrase in my signature :

“A famous researcher in parallelism said : Parallelism is a great problem. The basic question is though are you part of the solution or part of the problem?”

What is the key of success in every serious Parallel application? Know your Maths. You need to exhaust things serially using Maths and then see the bottlenecks to parallelize them. DO not put everything in the GPU grinder and call it a day. The specific problem I got involved was extremely complex and involved my best Mathematical and programming skills. So to a student attending this thread I need to pass him that Maths is an extremely necessary condition for his/her success in Scientific HPC programming.

This is why I regard the creation of compilers to automatically translate a problem into a parallel problem much more tougher than the translation of a language to another. There is a need to know the tremendous Mathematical abstraction that the human brain can achieve. Impossible for the next 20 years or more. With this I conclude. :-)