cuBlas performance dramatically drops after some iterations

kimran · January 18, 2015, 3:14pm

Dear All,

I run cublasSgemv(…) on the same inputs n times in a loop and print the performance of each run of cublasSgemv(…).

The performance is relatively stable per run for the first 1036 runs.
Then the performance becomes around 100x worse, suddenly and stays stabil around the new, worse value from there on.

Nothing else is done in the loop!
The iteration # at which the performance drops is always the same, but different if I change other parameters, e.g. the dimensions of the problem.

So the main code is:

clock_t start,end;
 for(i = 0; i < steps; ++i)
 {
  start = clock();
  cublasSgemv(handle,trans,N,M,&alpha,dW,N,dx,1,&beta,dx,1);
  end = clock();
  double dur = 1000.0*(end-start)/CLOCKS_PER_SEC;
  printf("%d: %f\n",i,dur);
 }

Before that I create the cublasHandle, allocate the memmory, after that I free it and destroy…i.e. standard stuff.

This was tested for a 5000x5000 Matrix, Ubuntu 14.04.1 LTS, NVIDIA Corporation GK104GL [GRID K520], code built using nvcc without any options.

Full code:

#include <stdio.h>
#include <stdlib.h>
#include "cublas_v2.h"
#include <time.h>

// Run cublasSgemv in loop
// Performance drops 100x after approx. 1000 iterations
// Using NVIDIA Corporation GK104GL [GRID K520]
// Ubuntu 14.04.1 LTS
int main(int _argc,char **_argv)
{
 // how often to repeat cublas operation
 int steps = atoi(_argv[1]);
 
 // looping variables
 int i,j;
 
 // dimensions of problem: matrix is NxM
 int N = 5000;
 int M = 5000;
 
 // Alloc host memory, declare all pointers
 float *W = 0;
 float *x = 0;
 float *dW;
 float *dx;
 W = (float*)malloc(N*M*sizeof(*W));
 x = (float*)malloc(M*sizeof(*x));

 // Set W and x to zero
 for(i = 0; i < M; ++i)
 {
  for(j = 0; j < N; ++j)
  {
   W[i*N + j] = 0.0f;
  }
  x[i] = 0.0f;
 }
 
 // Create CUDA context
 cublasHandle_t handle;
 cublasCreate(&handle);

 // Alloc device memory
 cudaMalloc((void**)&dW,N*M*sizeof(*W));
 cudaMalloc((void**)&dx,M*sizeof(*x));
 
 // Set W and x on device
 cublasSetMatrix(N,M,sizeof(*W),W,N,dW,N);
 cublasSetVector(M,sizeof(*x),x,1,dx,1);

 // We will calculate x=alpha*W*x + beta*x
 float alpha = 1.0f;
 float beta = 1.0f;
 cublasOperation_t trans = CUBLAS_OP_N;
 
 clock_t start,end;
 for(i = 0; i < steps; ++i)
 {
  start = clock();
  cublasSgemv(handle,trans,N,M,&alpha,dW,N,dx,1,&beta,dx,1);
  end = clock();
  double dur = 1000.0*(end-start)/CLOCKS_PER_SEC;
  printf("%d: %f\n",i,dur);
 }
 
 // Outnit
 cudaFree(dW);
 cudaFree(dx);
 cublasDestroy(handle);
 
 free(W);
 free(x);
 
 return 0;
 
}

Result on my machine (actually aws gpu instance):
until iteration 1035: ca. 0.006 [ms]
iteration 1036: ca. 0.52 [ms]
from iteration 1037 on: ca. 0.98 [ms]

Please help, whats going on? Is this expected? Losing 100x performance suddenly is not desirable.

Thanks,
kimran

Vectorizer · January 18, 2015, 3:42pm

How can one copy the code without the line numbers?

kimran · January 18, 2015, 3:53pm

Sorry, line #'s were added when using the

block…here’s the code without format:

#include <stdio.h>
#include <stdlib.h>
#include “cublas_v2.h”
#include <time.h>

// Run cublasSgemv in loop
// Performance drops 100x after approx. 1000 iterations
// Using NVIDIA Corporation GK104GL [GRID K520]
// Ubuntu 14.04.1 LTS
int main(int _argc,char **_argv)
{
// how often to repeat cublas operation
int steps = atoi(_argv[1]);

// looping variables
int i,j;

// dimensions of problem: matrix is NxM
int N = 5000;
int M = 5000;

// Alloc host memory, declare all pointers
float W = 0;
float x = 0;
float dW;
float dx;
W = (float)malloc(NMsizeof(W));
x = (float)malloc(Msizeof(*x));

// Set W and x to zero
for(i = 0; i < M; ++i)
{
for(j = 0; j < N; ++j)
{
W[i*N + j] = 0.0f;
}
x[i] = 0.0f;
}

// Create CUDA context
cublasHandle_t handle;
cublasCreate(&handle);

// Alloc device memory
cudaMalloc((void**)&dW,NMsizeof(W));
cudaMalloc((void*)&dx,M*sizeof(*x));

// Set W and x on device
cublasSetMatrix(N,M,sizeof(*W),W,N,dW,N);
cublasSetVector(M,sizeof(*x),x,1,dx,1);

// We will calculate x=alphaWx + beta*x
float alpha = 1.0f;
float beta = 1.0f;
cublasOperation_t trans = CUBLAS_OP_N;

clock_t start,end;
for(i = 0; i < steps; ++i)
{
start = clock();
cublasSgemv(handle,trans,N,M,&alpha,dW,N,dx,1,&beta,dx,1);
end = clock();
double dur = 1000.0*(end-start)/CLOCKS_PER_SEC;
printf(“%d: %f\n”,i,dur);
}

// Outnit
cudaFree(dW);
cudaFree(dx);
cublasDestroy(handle);

free(W);
free(x);

return 0;

}

njuffa · January 18, 2015, 4:59pm

Most of the work of cublasSgemv() is done by a kernel that runs on the GPU, asynchronously. What the code in its present form is measuring is the time it takes for the host code to issue the kernel launch to the GPU. Kernel launches are placed in a queue of finite depth, for example 1024 entries. Once the queue is full, the host code will stall until the next kernel launch can be inserted into the queue. This is why you observe slowdown after approximately 1000 iterations.

If you place a call to cudaDeviceSynchronize() directly after the call to cublasSgemv() the host code will wait until cublasSgemv() completes, and you will get realistic measurements. Since the call to cudaDeviceSynchronize() itself also takes some time, you may want to calibrate your measurements accordingly.

In general, use of the CUDA profiler is the easiest way to get accurate measurements for execution times of kernels. For your purposes, the minimal profiling capability built into the CUDA driver may be sufficient. You can turn it on by exporting the environment variable CUDA_PROFILE=1. This will cause a simple log file to be written. Don’t forget to unset CUDA_PROFILE when you are done, as profiling can cause some slowdown.

kimran · January 18, 2015, 6:48pm

Thanks!

Your answer sounds plausible and I just verified using cudaDeviceSynchronize()

kimran

Topic		Replies	Views
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	9994	March 24, 2014
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	923	August 23, 2018
GTX 660 and Nano performance drop-off after sustained matrix multiplies CUDA Programming and Performance	16	779	July 15, 2022
Time Measurement for CUBLAS why time (clock()) for CUBLAS is always 0 ms for any array size? CUDA Programming and Performance	2	2614	March 21, 2009
Slow CUDA SGEMM CUDA Programming and Performance	5	576	September 15, 2022
CUBLAS + Kernels + double precision + "for loop" = strange behavior CUDA Programming and Performance	17	4310	September 25, 2009
CUBLAS problem CUDA Programming and Performance	16	3501	July 1, 2010
CUBLAS Level 1 and Level 2 BLAS has 0 computaional time. Is it correct? Assesment of the CUBLAS leve CUDA Programming and Performance	3	3684	April 24, 2009
CUBLAS matrix-vector multiplication CUDA Programming and Performance	14	9925	January 20, 2010
CUBLAS calling sgemm() multiple times - Change in CPU time CUDA Programming and Performance	4	1828	August 9, 2018

cuBlas performance dramatically drops after some iterations

Related topics