Oscilating performance, Code total times variates

I did a code wich takes most of the time 2.000000 ms but ocasionally it takes around 40.00000ms the code is something like:


8 diferent memsets

kernel 1

kernel 2

for {

kernel 3

kernel 4

kernel 5

kernel 6


this happens after a few runs of the function each kernel or memset call (all else comented) inside or outside individually the for gives this behavior. Im running this function 5 times each for 7 cicles and around 200 times.

I scratched floating point issues since even the memsets do this behavior any sugestion on what may be the problem?

How do you measure the run times? Do you remember that kernel launches are asynchronous and get implicitly synced @ memcpys/memsets?

To get there time I do

// create and start timer

	unsigned int timer = 0;




	// stop and destroy timer


	printf("Processing time: %f (ms) \n", cutGetTimerValue(timer));


And I do them for n steps example

for 7 steps all inside the for for 5 times (everything). This doesn’t create any issue for five times but if a call func 5 times 20 times (20 calls to 5 func calls = 20 * 5 calls 7 times the func for each), occasionally I get a variance in time performance of func (2 to 40 ms). For one of the 5 calls never the same apparently with no specific padron but one takes that 40 ms all others take 2ms this appens never on the same as before (the last 40 ms problemtic one).

  1. CUDA is well known to have erratic performance spikes when saturating the GPU with kernel calls (I’ve brought this issue up multiple times with multiple theories, no one from nVidia is yet to explain what the cause might be - or a way to better profile my code such that I can understand the cause).

  2. I can hear tmurray coming now to harass you about using cutil.

Time for my favorite thing–blaming cutil! CPU timers are not very good indicators of what the GPU is doing, especially when you have GPU timers. The correct way to time kernels is with events, plus this also removes the need for cudaThreadSynchronize.

I’m willing to guess you probably forgot a cudaThreadSynchronize OR the cutil timers are not high precision for whatever reason.

A learning from an old thread:

  1. Sometimes, when screen saver kicks in, the GPU clocks are brought down… And can increase ur turn around time.

Also some powr mgmt feature (which screen savers use, i guess) can bring down GPU clocks.

:) I liked the cutil sugestion (Im already using it), about using the cuda libraries also I didn’t used CuBLas because many of the made functions forced me to make more calls than actually programing the kernels by myself.

The suggestion to replace cudaThreadSyncronize actually very interesting! Gonna try that to see what happens.

cudaThreadSyncronize solves the problem but kills much of the performance, thats my issue now. Its a kernels call bottleneck that I am sure of it by now. And is related to the fact that the kernells if possible run assyncronoully if Im correct, the issues is there is no clear scaling of the kernels priority wich causes the peaks. This would be no issues if I wasn’t needing this code for a real-time application and not an off-line one wich would not give this issue.

Anyone compared cudaThreads with cuda Events to state that its beater to use cuda Events?

carefull there is a wierd bug in cuda in some linux (openSuse gave that issue to a friend of mine), that when the screen saver activated there was an overeating and only the system made a shutdown but the graphic card became glitchy since then (possibly the Graphic card or the motherboard got partially fried).

But to be fair the room add no proper ventilation so the laptop (another extra to consider) was even hotter and also the ventilation plate was off (so it was partially user stupidity).

The problem is that cudaThreadSynchronize also degrades performance A LOT!

nosoul - Kindly note that kernel launches are asynchrnous and will return immediately whether or not the kernel completes.

By doing cudaThreadSynchronize() {CTS from now on},you are basically waiting for things to complete before proceeding.

CTS does not degrade anything.

Note that when you keep executing CUDA calls in a thread – it is like enqueing them to a queue (the default stream). They get finished one after another. So, if u dont do CTS and follow the kernel launch with cudaMemcpy then the cudaMemcpy will execute only after the kernel finishes (i.e. has the effect of CTS). FYI.

I know that it has the same effect, and the same performance degradation if your running something that only must be controlled by the cpu but not stored in its memory. In fact the papers some by nVidia state that to optimize our code performance memory copies if not necessary should be reduced to the max (one H2D and if necessary one D2H).

Apparently I got myself into an 300 cycle memory latency which I am trying to hide according to some papers I found.