CUBLAS iteration processing time increases with iteration

zspalotai · August 15, 2007, 12:12pm

I have the following code, which basically calculates d_s s.t. d_x=d_Q*d_s, by iterating the following two equations:

d_e = d_x - d_Q*d_s

d_s = d_s + alpha * d_W * d_e

where d_W is the transpose of d_Q

for (i = 0; i < 6; ++i){

  cutStartTimer( timer);

  cublasScopy(DIMX, d_x, 1, d_e, 1);

  cublasSgemv('N', DIMX, DIMS, -1.0, d_Q, DIMX, d_s, 1, 1.0, d_e, 1);

  cublasSgemv('N', DIMS, DIMX, alpha, d_W, DIMS, d_e, 1, 1.0, d_s, 1);

     cutStopTimer( timer);

	}

If the iteration number is below six then the average processing time for the 3 functions is around 0.05 ms. When I increase the iteration number to 100 the average processing time becomes 1.5 ms per iteration. (DIMX=161024, DIMS=1616).

If I omit the 3rd cublas function call then the slowdown rate is much smaller.

I’m using a GF8800GTX with clock and memory speed decreased to minimum, if I reset to default values the same slowdown effect occurs. I use the graphic card for the display too, but that means only a 5s limit, if I’m right.

Do you have any idea why is this slowdown happening when I increase the iteration number?

mfatica · August 15, 2007, 8:05pm

Enable the profiler (export CUDA_PROFILE=1) and check the time for each call. They should be the same.

It may be an artifact of the CUDA app being swapped out for a time slice. As the program runs longer, the probability of this swap increases if you have a single GPU for display/compute.

zspalotai · August 16, 2007, 10:40am

Thanks for the help, I checked the times with the profiler and they are the same in each iteration.

The cuda timer measured small times in the first few iterations because cublas calls return asynchronously.

sicb0161 · August 16, 2007, 1:37pm

I do not really understand what it means that “cublas calls return asynchronously”?

Also I would like to know what exactly gpu_time and cpu_time means.

“method=[ sgemvn_main ] gputime=[ 573.120 ] cputime=[ 584.323 ] occupancy=[ 0.667 ]”

does sgemv need 573.120 + 584.323 ms to execute a matrix-vector multiplication ? or does the sgemv need an overall time of 584.323 ms? and how come the cpu is required for the sgemv ?

mfatica · August 16, 2007, 2:47pm

It means that after invoking the kernel function, the control returns to the CPU without waiting for the completion of the kernel.
To get accurate timing, you should call cudaThreadSynchronize() before calling your timer or enable the profiler that forces blocking calls.

The overall time for that call was 584.323 us. The difference between the cputime and gputime is the overhead of calling the kernel.

sicb0161 · August 17, 2007, 10:17am

very nice :) thank you. This helps a lot.

Topic		Replies	Views
CUBLAS performance issues CUDA Programming and Performance	3	2651	March 21, 2008
On timing and timer CUDA Programming and Performance	7	4190	July 15, 2009
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	9997	March 24, 2014
cuBlas performance dramatically drops after some iterations CUDA Programming and Performance	4	893	January 18, 2015
cudaMemcpy slows down cublas_Get_Matrix after cublas_dGEMM operations CUDA Programming and Performance cuda	1	369	September 4, 2020
Kernel dimension influences cudaMemcpy? CUDA Programming and Performance	4	2413	September 26, 2007
Slow CUDA SGEMM CUDA Programming and Performance	5	582	September 15, 2022
CUBLAS slow down after calling CUFFT real to complex transform CUDA Programming and Performance	2	914	February 18, 2013
How properly counting a performance/program time ? CUDA Programming and Performance	4	2568	August 28, 2007
cuBLAS sgemm is slow CUDA Programming and Performance	4	2402	June 26, 2017

CUBLAS iteration processing time increases with iteration

Related topics