speed not stable,and performance lost Maybe a HUGE bug

darkstorm · November 28, 2007, 12:03pm

Hi,all
I am working on applying CUDA on image processing,and recently I have found something rather confusing.

Enviroment: CUDA 1.0 ,8800GTS,4G memory,CORE2 DUO 6500E,162.01_int driver,Windows XP, DX9(I think it has nothing to do with the DX ,though)
grid:(74,1,1)
blocks:(64,1,1)
shared memory usage: ~15k/block

In my programme, I have to run a function 74 times. I wrote my function 74 times,rather than a for loop.

global Myfunc();

Myfunc<<<>>>(temp1,temp0);
Myfunc<<<>>>(temp0,temp1);
…74 times…
Myfunc<<<>>>(temp1,temp0);
Myfunc<<<>>>(temp0,temp1);

I cut time for only one run, and it took only 0.041 ms, which is pretty good. and the first 20 run cost only 1-2 ms, GOOD.
40 run ,14 ms
I have to run it 74 times ,and it cost 53 ms! This is by no means fast comparing to CPU.

In the SDK “transpose” exsample a similar problem occured:
itrationtimes: naive optimized
1 : 0.580 : 0.011
2 : 0.292 : 0.011
8 : 0.081 : 0.012 //still 10x speedup as intro said
9 : 0.072 : 0.435 //!
16 : 0.046 : 3.553 //not only not stable ,but also much slower
and when iteration continues to increase ,speed was still not stable ,and performance lost was very huge.
40 : 2.207 : 1.895
100 : 3.156 : 1.051
200 : 3.473 : 0.767 //optimized 75times slower,naive 6 times slower

I don`t know what caused this, compiler hides the memory transpost time?CPU-GPU comunication caused speed lost?

Anybody have run transpose example and got a better result?Which Speed is real?Can the lost performance be found back by change settings or using CUDA 1.1 and later driver?Thanks for reply.

At least, maybe the cut time fuction is not reliable enough to estimate the speed of your programme.

MisterAnderson42 · November 28, 2007, 1:14pm

Kernels are launched asynchronously, and the queue depth for kernel launches is ~16 by my tests. You clearly are not calling cudaThreadSynchronnize() to wait for the kernel to complete before marking the end time.

darkstorm · November 28, 2007, 2:08pm

Thanks for reply
then the slower one is the real speed or the performance could be enhanced by syncronization?

MisterAnderson42 · November 28, 2007, 2:33pm

It is impossible to even know what you are measuring now, so I have no idea what the “real” execution time is. Some of the values you quote could be measuring the time of the last 4 kernel launches, or the last 2. Until you put a cudaThreadSynchronize() before every call to your timer, you are measuring garbage data and can make absolutley no conclusions from it.

synchronizaiong cannot enhance performance, it can only slow things down by preventing the driver from overlapping CPU and GPU computations. Synchronizing will however, cause you to measure the proper times of kernel launches.

Note that you can also use the CUDA PROFILER for this task. See the docs in the toolkit download.

darkstorm · November 28, 2007, 3:01pm

Thanks really

darkstorm · November 29, 2007, 1:20am

Another quistion
you mentioned that the time I quote could be the time of last 2 or 4 kernel launches , but in transpose example it was average time caculated by:
naivetime/iterationtimes
optimizedtime/iterationtimes
Is there something wrong with the example?The initial value of iterationtime is 1, then is the speed of 0.58/0.011 reliable?

paulius · November 29, 2007, 4:55am

There is noting wrong with the way the transpose sample measures average time. cudaThreadSynchronize() is called before starting and stopping each timer. So, the elapsed time is exactly the time for as many kernels as were launched between the cudaThreadSynchronize() calls.

Paulius

Topic		Replies	Views
A frustrating question ! call for help About the cuda timer CUDA Programming and Performance	1	2694	February 19, 2009
A frustrating question ! call for help About the cuda timer CUDA Programming and Performance	4	1604	February 18, 2009
Slow Down a little later CUDA Programming and Performance	4	5337	July 30, 2007
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8872	December 18, 2008
Speed reduces 17 -> 20 times after the kernel is called 9th times! T_T! CUDA Programming and Performance	4	2543	November 18, 2008
CUDA very slow performance CUDA Programming and Performance	21	17048	March 6, 2020
Odd performance problem/question CUDA Programming and Performance	3	897	June 3, 2009
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10702	June 21, 2009
Transpose example performance problem CUDA Programming and Performance	0	1986	May 12, 2009
Unusual delays does anyone recognize this pattern... CUDA Programming and Performance	9	1798	May 7, 2009

speed not stable,and performance lost Maybe a HUGE bug

Related topics