cublasSgemv & TransferTime


I was comparing the computation time (allocation, transfer, computation) of a Matlab matrix-vector and a CUDA matrix-vector multiplication using the mex options in Matlab as it is suggested by NVIDIA. In contrast to the matrix - matrix multiplication cublasSgemm, the function cuablsSgemv has poor results regarding the computation time: max 6x speed up for cublasSgemm , min 2.5 x speed “down” for cublasSgemv!

Both programs are written in the same way - the only difference is the use of different set and get methods (cublasGet/SetVector/Matrix) and the computation functions (cublasSgemm/Sgemv).

See the results :



Where is the problem ?

In order to see where the bottleneck is, I wrote the following test program. I get weird results, when the matrix dim is 1000x1000: 97 % of the overall computation time is wasted by transfering data ?! Problem with cublasSet- and cubalsGetvector or the timing function ???


//! Run a simple test for CUDA



runTest( int argc, char** argv) 


	// cublasStatus status;

   unsigned int timer_allocation = 0;

    unsigned int timer_transfer = 0;

    unsigned int timer_computation = 0;

    float total_time = 0;

   float*  device_A = NULL;

    float*  device_x = NULL;

    float*  device_y = NULL;

   float*  host_A = NULL;

    float*  host_x = NULL;

    float*  host_y = NULL;

   int m = 0,n = 0;

   int mem_size_matrix, mem_size_vector;


   printf("Enter matrix dimensions!\n");

    printf("Enter number of rows: ");


    printf("Enter number of cols: ");


   mem_size_matrix = sizeof(float) * m * n; 

    mem_size_vector = sizeof(float) * n;

   // allocate host memory

    host_A = (float*) malloc(mem_size_matrix);

    host_x = (float*) malloc(mem_size_vector);

    host_y = (float*) malloc(mem_size_vector);

   // initialize host data

    randomInit(host_A, m * n);

    randomInit(host_x, n);

    randomInit(host_y, n);


   cutCreateTimer( &timer_allocation);

    cutCreateTimer( &timer_transfer);

    cutCreateTimer( &timer_computation);


    // allocate device memory 

    cutStartTimer( timer_allocation);	

    cudaMalloc( (void**) &device_A, mem_size_matrix);

    cudaMalloc( (void**) &device_x, mem_size_vector);

    cudaMalloc( (void**) &device_y, mem_size_vector);

    cutStopTimer( timer_allocation);

   // copy host memory to device

    cutStartTimer( timer_transfer);

    cublasSetMatrix(m,n, sizeof(float), host_A, m , device_A, m);

    cublasSetVector(n, sizeof(float), host_x, 1 , device_x, 1);

    cublasSetVector(n, sizeof(float), host_y, 1 , device_y, 1);

    cutStopTimer( timer_transfer);

   // computation

   cutStartTimer( timer_computation);

   cublasSgemv('n' , m , n , 1.0 , device_A , m , device_x , 1, 1.0 , device_y , 1);

   cutStopTimer( timer_computation);

   // copy result from device to host

    cutStartTimer( timer_transfer);

    cublasGetVector(m, sizeof(float), device_y, 1, host_y, 1);

    cutStopTimer( timer_transfer);

   total_time = cutGetTimerValue(timer_allocation) +  cutGetTimerValue(timer_transfer) + cutGetTimerValue(timer_computation);

    printf( "Allocation : %f (ms) : %3.2f %% \n", cutGetTimerValue(timer_allocation), 100 * cutGetTimerValue(timer_allocation) / total_time);

    printf( "Transfer   : %f (ms) : %3.2f %% \n", cutGetTimerValue(timer_transfer), 100 * cutGetTimerValue(timer_transfer) / total_time);

    printf( "Computation: %f (ms) : %3.2f %% \n", cutGetTimerValue(timer_computation), 100 * cutGetTimerValue(timer_computation) / total_time);

    printf( "Overall    : %f (ms)\n", total_time);


    cutDeleteTimer( timer_allocation);

    cutDeleteTimer( timer_transfer);

    cutDeleteTimer( timer_computation);

   // cleanup memory

    free( host_A);

    free( host_x);

    free( host_y);





// Allocates a matrix with random float entries.

void randomInit(float* data, int size)


    for (int i = 0; i < size; ++i)

        data[i] = rand() / (float)RAND_MAX;


Thanks for help. Cem

You are comparing a level 3 BLAS call (Sgemm) with a level 2 BLAS call (Sgemv).
For Sgemm, you are moving O(N^2) data and performing O(N^3) flops, while for Sgemv you are moving O(N^2) and performing O(N^2) flops. So the transfer time is going to weight more for the level 2.

The best way of using cuBLAS is to group together several calls ( move the data to the device and apply several BLAS functions before you transfer them back).
You could also try to use pinned memory on the host ( this is not possible from MATLAB right now, but if you are writing a standalone program you could use cudaMallosHost instead of malloc to allocate the memory and get a faster transfer)

ahh yes :-) That is true. However, I thought that the transfer time would be small enough in order to compete with Matlabs bult in function.

I am now computing the number of level 2 operations needed for having a better transfer to computation time ratio .

thanks for your help .

After a deeper investigation, I found out that it is not because of the transfer size but also because of the poor transfer data rate and the measured performance of the matrix - vector multiplication.


I measured a linear increase of the transfer data rate depending on the size of the data in the ranges between 0 MB - 20 MB of y = 20e3 * x where x is the size in MB and y is the data transfer rate. After a transfer data rate of 20 MB the transfer rate is constant by round about 2 GB / s.


When only considering the computation time, then the measured peak performance is also increasing constantly between the matrix sizes 1 and 1500, beeing more or less constant after the matrix size 2000 with a peak performance of 11.5 GFlop/s.

After analyzing both factors which are contributing to the total computation time, then we can say that:

There is no chance that with a single iteration of a matrix vector multiplication cublas_sgemv can beat a built in function of matlab or intel math kernel library function no matter what matrix size is taken. In the first instance this poor behavior is caused by the channel characteristics of the PCI - Express bus mentioned in point 1.

When we increase the number of iterations so that the transfer time can be neglected we can only expect a speed up of 1.5 at maximum. This can be explained by the 2nd point.

So I hope that the performance of the level2 operations will improve, otherwise it does not make sense to use them, unless one has a matrix dimension greater than 1500 AND iteration number greater than 5. This is what I found out.