cublasSgemv & TransferTime

sicb0161 · August 13, 2007, 10:32am

Hi,

I was comparing the computation time (allocation, transfer, computation) of a Matlab matrix-vector and a CUDA matrix-vector multiplication using the mex options in Matlab as it is suggested by NVIDIA. In contrast to the matrix - matrix multiplication cublasSgemm, the function cuablsSgemv has poor results regarding the computation time: max 6x speed up for cublasSgemm , min 2.5 x speed “down” for cublasSgemv!

Both programs are written in the same way - the only difference is the use of different set and get methods (cublasGet/SetVector/Matrix) and the computation functions (cublasSgemm/Sgemv).

See the results :

SGEMM

External Media

SGEMV

External Media

Where is the problem ?

In order to see where the bottleneck is, I wrote the following test program. I get weird results, when the matrix dim is 1000x1000: 97 % of the overall computation time is wasted by transfering data ?! Problem with cublasSet- and cubalsGetvector or the timing function ???

////////////////////////////////////////////////////////////////////////////////

//! Run a simple test for CUDA

////////////////////////////////////////////////////////////////////////////////

void

runTest( int argc, char** argv) 

{

	// cublasStatus status;

   unsigned int timer_allocation = 0;

    unsigned int timer_transfer = 0;

    unsigned int timer_computation = 0;

    float total_time = 0;

   float*  device_A = NULL;

    float*  device_x = NULL;

    float*  device_y = NULL;

   float*  host_A = NULL;

    float*  host_x = NULL;

    float*  host_y = NULL;

   int m = 0,n = 0;

   int mem_size_matrix, mem_size_vector;

	

   printf("Enter matrix dimensions!\n");

    printf("Enter number of rows: ");

    scanf("%d",&m);

    printf("Enter number of cols: ");

    scanf("%d",&n);

   mem_size_matrix = sizeof(float) * m * n; 

    mem_size_vector = sizeof(float) * n;

   // allocate host memory

    host_A = (float*) malloc(mem_size_matrix);

    host_x = (float*) malloc(mem_size_vector);

    host_y = (float*) malloc(mem_size_vector);

   // initialize host data

    randomInit(host_A, m * n);

    randomInit(host_x, n);

    randomInit(host_y, n);

   cublasInit();

   cutCreateTimer( &timer_allocation);

    cutCreateTimer( &timer_transfer);

    cutCreateTimer( &timer_computation);

	

    // allocate device memory 

    cutStartTimer( timer_allocation);	

    cudaMalloc( (void**) &device_A, mem_size_matrix);

    cudaMalloc( (void**) &device_x, mem_size_vector);

    cudaMalloc( (void**) &device_y, mem_size_vector);

    cutStopTimer( timer_allocation);

   // copy host memory to device

    cutStartTimer( timer_transfer);

    cublasSetMatrix(m,n, sizeof(float), host_A, m , device_A, m);

    cublasSetVector(n, sizeof(float), host_x, 1 , device_x, 1);

    cublasSetVector(n, sizeof(float), host_y, 1 , device_y, 1);

    cutStopTimer( timer_transfer);

   // computation

   cutStartTimer( timer_computation);

   cublasSgemv('n' , m , n , 1.0 , device_A , m , device_x , 1, 1.0 , device_y , 1);

   cutStopTimer( timer_computation);

   // copy result from device to host

    cutStartTimer( timer_transfer);

    cublasGetVector(m, sizeof(float), device_y, 1, host_y, 1);

    cutStopTimer( timer_transfer);

   total_time = cutGetTimerValue(timer_allocation) +  cutGetTimerValue(timer_transfer) + cutGetTimerValue(timer_computation);

    printf( "Allocation : %f (ms) : %3.2f %% \n", cutGetTimerValue(timer_allocation), 100 * cutGetTimerValue(timer_allocation) / total_time);

    printf( "Transfer   : %f (ms) : %3.2f %% \n", cutGetTimerValue(timer_transfer), 100 * cutGetTimerValue(timer_transfer) / total_time);

    printf( "Computation: %f (ms) : %3.2f %% \n", cutGetTimerValue(timer_computation), 100 * cutGetTimerValue(timer_computation) / total_time);

    printf( "Overall    : %f (ms)\n", total_time);

    

    cutDeleteTimer( timer_allocation);

    cutDeleteTimer( timer_transfer);

    cutDeleteTimer( timer_computation);

   // cleanup memory

    free( host_A);

    free( host_x);

    free( host_y);

    cublasFree(device_A);

    cublasFree(device_y);

    cublasFree(device_x);

}

// Allocates a matrix with random float entries.

void randomInit(float* data, int size)

{

    for (int i = 0; i < size; ++i)

        data[i] = rand() / (float)RAND_MAX;

}

Thanks for help. Cem

mfatica · August 13, 2007, 2:57pm

You are comparing a level 3 BLAS call (Sgemm) with a level 2 BLAS call (Sgemv).
For Sgemm, you are moving O(N^2) data and performing O(N^3) flops, while for Sgemv you are moving O(N^2) and performing O(N^2) flops. So the transfer time is going to weight more for the level 2.

The best way of using cuBLAS is to group together several calls ( move the data to the device and apply several BLAS functions before you transfer them back).
You could also try to use pinned memory on the host ( this is not possible from MATLAB right now, but if you are writing a standalone program you could use cudaMallosHost instead of malloc to allocate the memory and get a faster transfer)

sicb0161 · August 13, 2007, 5:39pm

ahh yes :-) That is true. However, I thought that the transfer time would be small enough in order to compete with Matlabs bult in function.

I am now computing the number of level 2 operations needed for having a better transfer to computation time ratio .

thanks for your help .

sicb0161 · August 18, 2007, 7:16pm

After a deeper investigation, I found out that it is not because of the transfer size but also because of the poor transfer data rate and the measured performance of the matrix - vector multiplication.

1.)

I measured a linear increase of the transfer data rate depending on the size of the data in the ranges between 0 MB - 20 MB of y = 20e3 * x where x is the size in MB and y is the data transfer rate. After a transfer data rate of 20 MB the transfer rate is constant by round about 2 GB / s.

2.)

When only considering the computation time, then the measured peak performance is also increasing constantly between the matrix sizes 1 and 1500, beeing more or less constant after the matrix size 2000 with a peak performance of 11.5 GFlop/s.

After analyzing both factors which are contributing to the total computation time, then we can say that:

There is no chance that with a single iteration of a matrix vector multiplication cublas_sgemv can beat a built in function of matlab or intel math kernel library function no matter what matrix size is taken. In the first instance this poor behavior is caused by the channel characteristics of the PCI - Express bus mentioned in point 1.

When we increase the number of iterations so that the transfer time can be neglected we can only expect a speed up of 1.5 at maximum. This can be explained by the 2nd point.

So I hope that the performance of the level2 operations will improve, otherwise it does not make sense to use them, unless one has a matrix dimension greater than 1500 AND iteration number greater than 5. This is what I found out.

Topic		Replies	Views
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	9997	March 24, 2014
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	924	August 23, 2018
CUBLAS matrix-vector multiplication CUDA Programming and Performance	14	9939	January 20, 2010
Slow CUDA SGEMM CUDA Programming and Performance	5	582	September 15, 2022
Why cublas is much slower than Matlab runs on CPU CUDA Programming and Performance	15	4963	February 10, 2011
best possible matrix-vector multiplication performance? poor guy with only an emulator wonders about CUDA Programming and Performance	6	5579	August 12, 2009
CUBLAS calling sgemm() multiple times - Change in CPU time CUDA Programming and Performance	4	1828	August 9, 2018
Reasonable timing with Cublas dgemm and sgemm CUDA Programming and Performance	15	4200	January 14, 2010
Alternatives to cublas matrix multiplication (row major) CUDA Programming and Performance	9	17384	April 7, 2010
How can I improve this code which only reduces half time for the same code using MATLAB, thanks! CUDA Programming and Performance	5	1631	September 29, 2009

cublasSgemv & TransferTime

Related topics