Oscilating performance, Code total times variates

nosoul · June 14, 2009, 3:34am

I did a code wich takes most of the time 2.000000 ms but ocasionally it takes around 40.00000ms the code is something like:

func::::

8 diferent memsets

kernel 1

kernel 2

for {

kernel 3

kernel 4

kernel 5

kernel 6

}

this happens after a few runs of the function each kernel or memset call (all else comented) inside or outside individually the for gives this behavior. Im running this function 5 times each for 7 cicles and around 200 times.

I scratched floating point issues since even the memsets do this behavior any sugestion on what may be the problem?

_Big_Mac · June 14, 2009, 2:06pm

How do you measure the run times? Do you remember that kernel launches are asynchronous and get implicitly synced @ memcpys/memsets?

nosoul · June 14, 2009, 8:57pm

To get there time I do

// create and start timer

	unsigned int timer = 0;

	CUT_SAFE_CALL(cutCreateTimer(&timer));

	CUT_SAFE_CALL(cutStartTimer(timer));

		func::

	// stop and destroy timer

	CUT_SAFE_CALL(cutStopTimer(timer));

	printf("Processing time: %f (ms) \n", cutGetTimerValue(timer));

	CUT_SAFE_CALL(cutDeleteTimer(timer));

And I do them for n steps example

for 7 steps all inside the for for 5 times (everything). This doesn’t create any issue for five times but if a call func 5 times 20 times (20 calls to 5 func calls = 20 * 5 calls 7 times the func for each), occasionally I get a variance in time performance of func (2 to 40 ms). For one of the 5 calls never the same apparently with no specific padron but one takes that 40 ms all others take 2ms this appens never on the same as before (the last 40 ms problemtic one).

Smokey · June 15, 2009, 1:14am

CUDA is well known to have erratic performance spikes when saturating the GPU with kernel calls (I’ve brought this issue up multiple times with multiple theories, no one from nVidia is yet to explain what the cause might be - or a way to better profile my code such that I can understand the cause).
I can hear tmurray coming now to harass you about using cutil.

tmurray · June 15, 2009, 3:47am

Time for my favorite thing–blaming cutil! CPU timers are not very good indicators of what the GPU is doing, especially when you have GPU timers. The correct way to time kernels is with events, plus this also removes the need for cudaThreadSynchronize.

I’m willing to guess you probably forgot a cudaThreadSynchronize OR the cutil timers are not high precision for whatever reason.

Sarnath · June 15, 2009, 6:12am

A learning from an old thread:

Sometimes, when screen saver kicks in, the GPU clocks are brought down… And can increase ur turn around time.

Also some powr mgmt feature (which screen savers use, i guess) can bring down GPU clocks.

nosoul · June 15, 2009, 6:03pm

:) I liked the cutil sugestion (Im already using it), about using the cuda libraries also I didn’t used CuBLas because many of the made functions forced me to make more calls than actually programing the kernels by myself.

The suggestion to replace cudaThreadSyncronize actually very interesting! Gonna try that to see what happens.

cudaThreadSyncronize solves the problem but kills much of the performance, thats my issue now. Its a kernels call bottleneck that I am sure of it by now. And is related to the fact that the kernells if possible run assyncronoully if Im correct, the issues is there is no clear scaling of the kernels priority wich causes the peaks. This would be no issues if I wasn’t needing this code for a real-time application and not an off-line one wich would not give this issue.

Anyone compared cudaThreads with cuda Events to state that its beater to use cuda Events?

nosoul · June 15, 2009, 6:08pm

carefull there is a wierd bug in cuda in some linux (openSuse gave that issue to a friend of mine), that when the screen saver activated there was an overeating and only the system made a shutdown but the graphic card became glitchy since then (possibly the Graphic card or the motherboard got partially fried).

But to be fair the room add no proper ventilation so the laptop (another extra to consider) was even hotter and also the ventilation plate was off (so it was partially user stupidity).

nosoul · June 17, 2009, 11:23am

The problem is that cudaThreadSynchronize also degrades performance A LOT!

Sarnath · June 17, 2009, 11:33am

nosoul - Kindly note that kernel launches are asynchrnous and will return immediately whether or not the kernel completes.

By doing cudaThreadSynchronize() {CTS from now on},you are basically waiting for things to complete before proceeding.

CTS does not degrade anything.

Note that when you keep executing CUDA calls in a thread – it is like enqueing them to a queue (the default stream). They get finished one after another. So, if u dont do CTS and follow the kernel launch with cudaMemcpy then the cudaMemcpy will execute only after the kernel finishes (i.e. has the effect of CTS). FYI.

nosoul · June 21, 2009, 3:01am

I know that it has the same effect, and the same performance degradation if your running something that only must be controlled by the cpu but not stored in its memory. In fact the papers some by nVidia state that to optimize our code performance memory copies if not necessary should be reduced to the max (one H2D and if necessary one D2H).

Apparently I got myself into an 300 cycle memory latency which I am trying to hide according to some papers I found.

Topic		Replies	Views
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7744	December 5, 2008
SPMT: Single Program Multiple (Exeuction) Time CUDA Programming and Performance	15	3911	July 4, 2009
Inconsistent kernel run times CUDA Programming and Performance	12	5801	August 5, 2009
Number of GPU clock cycles CUDA Programming and Performance	15	10456	June 16, 2017
CUDA execution multiples of 16ms CUDA Programming and Performance	14	2069	May 30, 2015
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1776	July 19, 2022
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9362	January 7, 2008
Odd performance problem/question CUDA Programming and Performance	3	838	June 3, 2009
CUDA erratic behavior, Code total times variates CUDA Programming and Performance	3	1047	June 14, 2009
Why does my kernel take too long occasionally? CUDA Programming and Performance	21	8804	October 13, 2010

Oscilating performance, Code total times variates

Related topics