cuBlas performance dramatically drops after some iterations

Dear All,

I run cublasSgemv(…) on the same inputs n times in a loop and print the performance of each run of cublasSgemv(…).

The performance is relatively stable per run for the first 1036 runs.
Then the performance becomes around 100x worse, suddenly and stays stabil around the new, worse value from there on.

Nothing else is done in the loop!
The iteration # at which the performance drops is always the same, but different if I change other parameters, e.g. the dimensions of the problem.

So the main code is:

clock_t start,end;
 for(i = 0; i < steps; ++i)
 {
  start = clock();
  cublasSgemv(handle,trans,N,M,&alpha,dW,N,dx,1,&beta,dx,1);
  end = clock();
  double dur = 1000.0*(end-start)/CLOCKS_PER_SEC;
  printf("%d: %f\n",i,dur);
 }

Before that I create the cublasHandle, allocate the memmory, after that I free it and destroy…i.e. standard stuff.

This was tested for a 5000x5000 Matrix, Ubuntu 14.04.1 LTS, NVIDIA Corporation GK104GL [GRID K520], code built using nvcc without any options.

Full code:

#include <stdio.h>
#include <stdlib.h>
#include "cublas_v2.h"
#include <time.h>

// Run cublasSgemv in loop
// Performance drops 100x after approx. 1000 iterations
// Using NVIDIA Corporation GK104GL [GRID K520]
// Ubuntu 14.04.1 LTS
int main(int _argc,char **_argv)
{
 // how often to repeat cublas operation
 int steps = atoi(_argv[1]);
 
 // looping variables
 int i,j;
 
 // dimensions of problem: matrix is NxM
 int N = 5000;
 int M = 5000;
 
 // Alloc host memory, declare all pointers
 float *W = 0;
 float *x = 0;
 float *dW;
 float *dx;
 W = (float*)malloc(N*M*sizeof(*W));
 x = (float*)malloc(M*sizeof(*x));

 // Set W and x to zero
 for(i = 0; i < M; ++i)
 {
  for(j = 0; j < N; ++j)
  {
   W[i*N + j] = 0.0f;
  }
  x[i] = 0.0f;
 }
 
 // Create CUDA context
 cublasHandle_t handle;
 cublasCreate(&handle);

 // Alloc device memory
 cudaMalloc((void**)&dW,N*M*sizeof(*W));
 cudaMalloc((void**)&dx,M*sizeof(*x));
 
 // Set W and x on device
 cublasSetMatrix(N,M,sizeof(*W),W,N,dW,N);
 cublasSetVector(M,sizeof(*x),x,1,dx,1);

 // We will calculate x=alpha*W*x + beta*x
 float alpha = 1.0f;
 float beta = 1.0f;
 cublasOperation_t trans = CUBLAS_OP_N;
 
 clock_t start,end;
 for(i = 0; i < steps; ++i)
 {
  start = clock();
  cublasSgemv(handle,trans,N,M,&alpha,dW,N,dx,1,&beta,dx,1);
  end = clock();
  double dur = 1000.0*(end-start)/CLOCKS_PER_SEC;
  printf("%d: %f\n",i,dur);
 }
 
 // Outnit
 cudaFree(dW);
 cudaFree(dx);
 cublasDestroy(handle);
 
 free(W);
 free(x);
 
 return 0;
 
}

Result on my machine (actually aws gpu instance):
until iteration 1035: ca. 0.006 [ms]
iteration 1036: ca. 0.52 [ms]
from iteration 1037 on: ca. 0.98 [ms]

Please help, whats going on? Is this expected? Losing 100x performance suddenly is not desirable.

Thanks,
kimran

How can one copy the code without the line numbers?

Sorry, line #'s were added when using the


block…here’s the code without format:

#include <stdio.h>
#include <stdlib.h>
#include “cublas_v2.h”
#include <time.h>

// Run cublasSgemv in loop
// Performance drops 100x after approx. 1000 iterations
// Using NVIDIA Corporation GK104GL [GRID K520]
// Ubuntu 14.04.1 LTS
int main(int _argc,char **_argv)
{
// how often to repeat cublas operation
int steps = atoi(_argv[1]);

// looping variables
int i,j;

// dimensions of problem: matrix is NxM
int N = 5000;
int M = 5000;

// Alloc host memory, declare all pointers
float W = 0;
float x = 0;
float dW;
float dx;
W = (float
)malloc(N
M
sizeof(W));
x = (float
)malloc(M
sizeof(*x));

// Set W and x to zero
for(i = 0; i < M; ++i)
{
for(j = 0; j < N; ++j)
{
W[i*N + j] = 0.0f;
}
x[i] = 0.0f;
}

// Create CUDA context
cublasHandle_t handle;
cublasCreate(&handle);

// Alloc device memory
cudaMalloc((void**)&dW,NMsizeof(W));
cudaMalloc((void
*)&dx,M*sizeof(*x));

// Set W and x on device
cublasSetMatrix(N,M,sizeof(*W),W,N,dW,N);
cublasSetVector(M,sizeof(*x),x,1,dx,1);

// We will calculate x=alphaWx + beta*x
float alpha = 1.0f;
float beta = 1.0f;
cublasOperation_t trans = CUBLAS_OP_N;

clock_t start,end;
for(i = 0; i < steps; ++i)
{
start = clock();
cublasSgemv(handle,trans,N,M,&alpha,dW,N,dx,1,&beta,dx,1);
end = clock();
double dur = 1000.0*(end-start)/CLOCKS_PER_SEC;
printf(“%d: %f\n”,i,dur);
}

// Outnit
cudaFree(dW);
cudaFree(dx);
cublasDestroy(handle);

free(W);
free(x);

return 0;

}

Most of the work of cublasSgemv() is done by a kernel that runs on the GPU, asynchronously. What the code in its present form is measuring is the time it takes for the host code to issue the kernel launch to the GPU. Kernel launches are placed in a queue of finite depth, for example 1024 entries. Once the queue is full, the host code will stall until the next kernel launch can be inserted into the queue. This is why you observe slowdown after approximately 1000 iterations.

If you place a call to cudaDeviceSynchronize() directly after the call to cublasSgemv() the host code will wait until cublasSgemv() completes, and you will get realistic measurements. Since the call to cudaDeviceSynchronize() itself also takes some time, you may want to calibrate your measurements accordingly.

In general, use of the CUDA profiler is the easiest way to get accurate measurements for execution times of kernels. For your purposes, the minimal profiling capability built into the CUDA driver may be sufficient. You can turn it on by exporting the environment variable CUDA_PROFILE=1. This will cause a simple log file to be written. Don’t forget to unset CUDA_PROFILE when you are done, as profiling can cause some slowdown.

Thanks!

Your answer sounds plausible and I just verified using cudaDeviceSynchronize()

kimran