CUBLAS iteration processing time increases with iteration

I have the following code, which basically calculates d_s s.t. d_x=d_Q*d_s, by iterating the following two equations:

d_e = d_x - d_Q*d_s

d_s = d_s + alpha * d_W * d_e

where d_W is the transpose of d_Q

for (i = 0; i < 6; ++i){

  cutStartTimer( timer);

  cublasScopy(DIMX, d_x, 1, d_e, 1);

  cublasSgemv('N', DIMX, DIMS, -1.0, d_Q, DIMX, d_s, 1, 1.0, d_e, 1);

  cublasSgemv('N', DIMS, DIMX, alpha, d_W, DIMS, d_e, 1, 1.0, d_s, 1);

     cutStopTimer( timer);

	}

If the iteration number is below six then the average processing time for the 3 functions is around 0.05 ms. When I increase the iteration number to 100 the average processing time becomes 1.5 ms per iteration. (DIMX=161024, DIMS=1616).

If I omit the 3rd cublas function call then the slowdown rate is much smaller.

I’m using a GF8800GTX with clock and memory speed decreased to minimum, if I reset to default values the same slowdown effect occurs. I use the graphic card for the display too, but that means only a 5s limit, if I’m right.

Do you have any idea why is this slowdown happening when I increase the iteration number?

Enable the profiler (export CUDA_PROFILE=1) and check the time for each call. They should be the same.

It may be an artifact of the CUDA app being swapped out for a time slice. As the program runs longer, the probability of this swap increases if you have a single GPU for display/compute.

Thanks for the help, I checked the times with the profiler and they are the same in each iteration.

The cuda timer measured small times in the first few iterations because cublas calls return asynchronously.

I do not really understand what it means that “cublas calls return asynchronously”?

Also I would like to know what exactly gpu_time and cpu_time means.

“method=[ sgemvn_main ] gputime=[ 573.120 ] cputime=[ 584.323 ] occupancy=[ 0.667 ]”

does sgemv need 573.120 + 584.323 ms to execute a matrix-vector multiplication ? or does the sgemv need an overall time of 584.323 ms? and how come the cpu is required for the sgemv ?

It means that after invoking the kernel function, the control returns to the CPU without waiting for the completion of the kernel.
To get accurate timing, you should call cudaThreadSynchronize() before calling your timer or enable the profiler that forces blocking calls.

The overall time for that call was 584.323 us. The difference between the cputime and gputime is the overhead of calling the kernel.

very nice :) thank you. This helps a lot.