Dear All,

I run cublasSgemv(…) on the same inputs n times in a loop and print the performance of each run of cublasSgemv(…).

The performance is relatively stable per run for the first 1036 runs.

Then the performance becomes around 100x worse, suddenly and stays stabil around the new, worse value from there on.

Nothing else is done in the loop!

The iteration # at which the performance drops is always the same, but different if I change other parameters, e.g. the dimensions of the problem.

So the main code is:

```
clock_t start,end;
for(i = 0; i < steps; ++i)
{
start = clock();
cublasSgemv(handle,trans,N,M,&alpha,dW,N,dx,1,&beta,dx,1);
end = clock();
double dur = 1000.0*(end-start)/CLOCKS_PER_SEC;
printf("%d: %f\n",i,dur);
}
```

Before that I create the cublasHandle, allocate the memmory, after that I free it and destroy…i.e. standard stuff.

This was tested for a 5000x5000 Matrix, Ubuntu 14.04.1 LTS, NVIDIA Corporation GK104GL [GRID K520], code built using nvcc without any options.

Full code:

```
#include <stdio.h>
#include <stdlib.h>
#include "cublas_v2.h"
#include <time.h>
// Run cublasSgemv in loop
// Performance drops 100x after approx. 1000 iterations
// Using NVIDIA Corporation GK104GL [GRID K520]
// Ubuntu 14.04.1 LTS
int main(int _argc,char **_argv)
{
// how often to repeat cublas operation
int steps = atoi(_argv[1]);
// looping variables
int i,j;
// dimensions of problem: matrix is NxM
int N = 5000;
int M = 5000;
// Alloc host memory, declare all pointers
float *W = 0;
float *x = 0;
float *dW;
float *dx;
W = (float*)malloc(N*M*sizeof(*W));
x = (float*)malloc(M*sizeof(*x));
// Set W and x to zero
for(i = 0; i < M; ++i)
{
for(j = 0; j < N; ++j)
{
W[i*N + j] = 0.0f;
}
x[i] = 0.0f;
}
// Create CUDA context
cublasHandle_t handle;
cublasCreate(&handle);
// Alloc device memory
cudaMalloc((void**)&dW,N*M*sizeof(*W));
cudaMalloc((void**)&dx,M*sizeof(*x));
// Set W and x on device
cublasSetMatrix(N,M,sizeof(*W),W,N,dW,N);
cublasSetVector(M,sizeof(*x),x,1,dx,1);
// We will calculate x=alpha*W*x + beta*x
float alpha = 1.0f;
float beta = 1.0f;
cublasOperation_t trans = CUBLAS_OP_N;
clock_t start,end;
for(i = 0; i < steps; ++i)
{
start = clock();
cublasSgemv(handle,trans,N,M,&alpha,dW,N,dx,1,&beta,dx,1);
end = clock();
double dur = 1000.0*(end-start)/CLOCKS_PER_SEC;
printf("%d: %f\n",i,dur);
}
// Outnit
cudaFree(dW);
cudaFree(dx);
cublasDestroy(handle);
free(W);
free(x);
return 0;
}
```

Result on my machine (actually aws gpu instance):

until iteration 1035: ca. 0.006 [ms]

iteration 1036: ca. 0.52 [ms]

from iteration 1037 on: ca. 0.98 [ms]

Please help, whats going on? Is this expected? Losing 100x performance suddenly is not desirable.

Thanks,

kimran