Disappointing CuBlas performance


I have been working on a visualization algorithm lately, which involves computing the product of three dense matrices for every “frame”. The matrix product is written as: A = MPN.

P is actually a small tri-tensor of size p x p x p, where p is in the range 4 to 20. M and N are much larger, ie. 1000 x p and p x 1000 respectively. However, only P is updated for every frame.

Since the result of this computation was to be used by GPU shaders, I implemented a crude GPGPU-style blocked matrix multiply using OpenGL shaders (very much based on the memory model described by Gavindaraju et. al. in their SC06 paper), but was planning to move to cublas as soon as it was released on Linux.

The big day was finally here yesterday, and porting my code to cublas was a breeze. Basically all I do is:

float* d_M; float* d_N; float* d_P; float* d_tmp; float* d_A; // Handles for GPU memory

cublasAlloc( m*p, sizeof(float), reinterpret_cast<void**>( d_M ) ); // Ditto for the rest of the handles

cublasSetVector( ...); // For M and N

while (rendering) (


  cublasSetVector( p*p*p, sizeof(float), P.data(), 1, d_P, 1 );

  // For every tensor slice

  for ( uint i = 0; i < p; ++i ) (

      // tmp = P*N

      cublasSgemm( 'n', 'n', p, n, p, 1.0f, d_T + i*p*p, p, d_N, p, 0.0f, d_tmp, p );

      // A = M*tmp     

      cublasSgemm( 'n', 'n', m, n, p, 1.0f, d_M,  m, d_tmp, p, 0.0f, d_A + i*m*n, m );


   // Use the result...


The numerics are correct and yield the correct answer. However performance is about a third(!) of my own GPU matrix multiply. (210 “FPS” versus 640 “FPS” when I only do the P calculation and the matrix multiply and p =4, m=n=1024)

It should be said that in my own GPGPU matrix multiply I do 4-slices at a time in a texture lookup for the P-tensor. I had the impression that the G80 was a scalar architechture, so there was no advantage to doing 4-component arithmetic. Of course, I get away with 1/4 of the number of readings of M and N. Would this explain the extreme drop off performance?

I have very little experience with the BLAS interface, is there a better way to do tensor products than my slice-by-slice approach?

Thanks in advande for any advice, comments etc.

Thanks for the feedback. We’re always interested in any performance results people are obtaining.

The main issue here appears to be the size of your matrices. cuBLAS is optimized for operating on large matrices. Matrices should not be narrower than 16 in any dimension, preferably not less than 32.

The CPU overhead for launching a kernel is quite high in the current CUDA release, so where possible you should try and do as much work as possible per cuBLAS call.

I’m not familiar with tensor products, but is it possible you could combine some of your vectors together into a larger matrix?