cuBLAS related question

AlexanderAgathos · January 28, 2013, 9:09pm

I would like to ask a question on the internals of cuBLAS which should have been mentioned in the Manual.
I am dealing with extremely big matrices and memory limitation is an issue. So a normal question would be when you apply CUBLAS_OP_T to a matrix in sgemm, dgemm or whatever function accepts it does cuBLAS create a new matrix internally and multiply or does it use some other technique. If it does not create a new Matrix then I can redesign my algorithm to increase the memory I can allocate. This should have been mentioned in the Manual. Its a very important issue.

AlexanderAgathos · January 28, 2013, 9:35pm

Anyway my guess is that it does not create the whole transpose matrix but it does some reordering of elements. There are also no statistics of using CUBLAS_OP_T. How much does it degrade performance of the multiplication. I am talking about a matrix of 540000x400 elements. Of course needless to say I work with a Tesla card M2075 to do such big calculations. In my algorithm I totally avoid CUBLAS_OP_T at the expense of memory. The problem is that there are NO STATISTICS showing how CUBLAS_OP_T degrades performance on such big matrices. Should I go ahead and use it with a guarantee that the degrade will be insignificant compared to using just CUBLAS_OP_N?

njuffa · January 28, 2013, 10:39pm

I would recommend timing the GEMM variants you would like to use. In general, GEMM handles all transposition modes at “full speed”. However, the various combinations of the transposition of A and B matrices lend themselves to different optimization approaches and thus some performance variation is inevitable. In the cases involving large matrices that I have measured, those differences have been less than 10%. By the way: the transpose=(N,N) combination wasn’t always the fastest.

njuffa · January 29, 2013, 8:09am

Here is a quick example of DGEMM performance on a K20c, for all four transpose mode combinations.
The difference between the slowest and the fastest case is 10.8%. Here the (N,T) combination is the fastest variant. As I recall from my own past work on CUBLAS this transpose combination lends itself most easily to effective optimization. All four modes run at full speed, around 1 TFLOPS.

args: ta=N tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13280010 sec GFLOPS=1034.93

args: ta=T tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13872910 sec GFLOPS=990.7

args: ta=N tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.12521601 sec GFLOPS=1097.61

args: ta=T tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13652611 sec GFLOPS=1006.69

Note that GEMMs on matrices with extreme aspect ratio could have lower performance than GEMMs on square-ish matrices. I am mentioning this because your matrices have an aspect ratio in excess of 1000:1. In any event, you can easily measure the performance of any transpose mode combination and size you plan to use.

AlexanderAgathos · January 29, 2013, 7:19pm

njuffa:

Here is a quick example of DGEMM performance on a K20c, for all four transpose mode combinations.
The difference between the slowest and the fastest case is 10.8%. Here the (N,T) combination is the fastest variant. As I recall from my own past work on CUBLAS this transpose combination lends itself most easily to effective optimization. All four modes run at full speed, around 1 TFLOPS.

args: ta=N tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13280010 sec GFLOPS=1034.93

args: ta=T tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13872910 sec GFLOPS=990.7

args: ta=N tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.12521601 sec GFLOPS=1097.61

args: ta=T tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13652611 sec GFLOPS=1006.69

Note that GEMMs on matrices with extreme aspect ratio could have lower performance than GEMMs on square-ish matrices. I am mentioning this because your matrices have an aspect ratio in excess of 1000:1. In any event, you can easily measure the performance of any transpose mode combination and size you plan to use.

Thanks for the timings. I carefully examined the structure of my matrices and managed to squeeze things to a 20 times smaller matrix. I will also use cuSparse for my computations. I will bring the Tesla card to its limits…I have to squeeze a Hyperspectral Image to the GPU and its a non-linear optimization with linear inequality and equality constraints. The matrix of the inequality constraints can get really huge. Luckily it has reundancy and its non random, I know the indices of the non zero elements. So all in all I am pushing the envelope making the impossible claimed by many Researchers to possible. Anyway thanks for the input.

njuffa · January 29, 2013, 7:47pm

I assume you are aware that sparse matrix computation is typically bound by memory bandwidth, rather than being bound by computational throughput as in dense matrix operations like GEMM.

AlexanderAgathos · January 29, 2013, 8:02pm

Yes but the size reduction in memory accesses is much much lower than the full matrix. we are talking about 1000000x400 accesses compared to 50000 x 400 accesses in the sparse representation…so the use of cuSparse is justified and I will not overdo it using it.

AlexanderAgathos · February 1, 2013, 7:16am

Yep I can prowdly say that I made the impossible possible …

njuffa · February 1, 2013, 7:19am

Good to hear it all worked out well. Maybe you could share a pointer to your work once it is published.

AlexanderAgathos · February 1, 2013, 8:44am

Yes it worked out prestty well 50 times faster than the serial implementation and 10 times less memory.
Yes I will share a pointer once published.

AlexanderAgathos · February 4, 2013, 10:44am

Also in cuSparse in csrmv, this is the only function I am calling. I have eliminated the need for any sparse matrix x matrix multiplication. Does anybody know what kind of optimization is followed? Does it exploit the structure of the matrix to achieve coalescing?

AlexanderAgathos · February 4, 2013, 10:46am

Needless to say there is no option for sparse x sparse matrix multiplication only sparse x dense. So I am lucky that I do not need any.

AlexanderAgathos · February 4, 2013, 10:50am

Anyway as a conclusion in my Research the libraries provided by CUDA SDK were sufficient for this very difficult I got involved and I will mention it in my paper. cuBLAS, cuSparse, Thrust.

njuffa · February 4, 2013, 3:19pm

It is good to hear that the effort invested into libraries is paying off for our customers. The library team is always interested in getting feedback on what functionality works well, and where additional functionality or performance could be helpful.

AlexanderAgathos · February 4, 2013, 3:48pm

Unfortunately I cant share more of my experience on this project due to reasons you may understand. When its done and published I might request a place in the CUDA spotlight. There I can describe my experience with CUDA and how it helped me to achieve the so claimed impossible. So I will come in contact with you in a few months. Thanks for your time, help and feedback.

AlexanderAgathos · February 6, 2013, 10:17am

I would like to make a point quoting again the phrase in my signature :

“A famous researcher in parallelism said : Parallelism is a great problem. The basic question is though are you part of the solution or part of the problem?”

What is the key of success in every serious Parallel application? Know your Maths. You need to exhaust things serially using Maths and then see the bottlenecks to parallelize them. DO not put everything in the GPU grinder and call it a day. The specific problem I got involved was extremely complex and involved my best Mathematical and programming skills. So to a student attending this thread I need to pass him that Maths is an extremely necessary condition for his/her success in Scientific HPC programming.

AlexanderAgathos · February 6, 2013, 10:27am

This is why I regard the creation of compilers to automatically translate a problem into a parallel problem much more tougher than the translation of a language to another. There is a need to know the tremendous Mathematical abstraction that the human brain can achieve. Impossible for the next 20 years or more. With this I conclude. :-)

Topic		Replies	Views
CuBLAS Showing Poor Performance CUDA Programming and Performance	6	1275	December 20, 2013
CUBLAS and coalesced operations Access patterns inside CUBLAS code? CUDA Programming and Performance	7	3662	September 21, 2008
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28196	February 1, 2011
Alternatives to cublas matrix multiplication (row major) CUDA Programming and Performance	9	17516	April 7, 2010
cusparseLtMatmul is slower than cublasGemmEx GPU-Accelerated Libraries cublas , cusparse	0	651	April 21, 2023
cusparseSpMM fp32 is slower than cublas cublasSgemm GPU-Accelerated Libraries cublas , cusparse	3	642	April 10, 2023
How to speed-up matrix multiplication using CUBLAS? CUDA Programming and Performance	6	7602	September 1, 2010
Low CuBLAS performance CUDA Programming and Performance	3	509	January 15, 2019
Slow CUDA SGEMM CUDA Programming and Performance	5	752	September 15, 2022
Cublas and Matrix Multiplication of a transpose matrix with CUBLAS CUDA Programming and Performance	10	13493	June 14, 2010

cuBLAS related question

Related topics