 # Disappointing CuBlas performance

Hello.

I have been working on a visualization algorithm lately, which involves computing the product of three dense matrices for every “frame”. The matrix product is written as: A = MPN.

P is actually a small tri-tensor of size p x p x p, where p is in the range 4 to 20. M and N are much larger, ie. 1000 x p and p x 1000 respectively. However, only P is updated for every frame.

Since the result of this computation was to be used by GPU shaders, I implemented a crude GPGPU-style blocked matrix multiply using OpenGL shaders (very much based on the memory model described by Gavindaraju et. al. in their SC06 paper), but was planning to move to cublas as soon as it was released on Linux.

The big day was finally here yesterday, and porting my code to cublas was a breeze. Basically all I do is:

``````float* d_M; float* d_N; float* d_P; float* d_tmp; float* d_A; // Handles for GPU memory

cublasAlloc( m*p, sizeof(float), reinterpret_cast<void**>( d_M ) ); // Ditto for the rest of the handles

cublasSetVector( ...); // For M and N

while (rendering) (

calculateP();

cublasSetVector( p*p*p, sizeof(float), P.data(), 1, d_P, 1 );

// For every tensor slice

for ( uint i = 0; i < p; ++i ) (

// tmp = P*N

cublasSgemm( 'n', 'n', p, n, p, 1.0f, d_T + i*p*p, p, d_N, p, 0.0f, d_tmp, p );

// A = M*tmp

cublasSgemm( 'n', 'n', m, n, p, 1.0f, d_M,  m, d_tmp, p, 0.0f, d_A + i*m*n, m );

)

// Use the result...

)
``````

The numerics are correct and yield the correct answer. However performance is about a third(!) of my own GPU matrix multiply. (210 “FPS” versus 640 “FPS” when I only do the P calculation and the matrix multiply and p =4, m=n=1024)

It should be said that in my own GPGPU matrix multiply I do 4-slices at a time in a texture lookup for the P-tensor. I had the impression that the G80 was a scalar architechture, so there was no advantage to doing 4-component arithmetic. Of course, I get away with 1/4 of the number of readings of M and N. Would this explain the extreme drop off performance?

I have very little experience with the BLAS interface, is there a better way to do tensor products than my slice-by-slice approach?