Odd Slowdown Problem Same function slows down in loop

trex · February 8, 2008, 1:32pm

Hi,

I have an odd problem and its really hurting my application time.

So my problem is this: I can run the exact same kernels 7 times and they are very fast. But there after the seventh iteration, the time taken increases tremendously.

My program Loads some data into cuda. The memory is all padded such that it is 2^N and aligned. My program has two kernel functions which are called one after the other. The first function prepares a data array (in device memory) which is then used in the second kernel. there is no thread sync at all as each thread is independent of everything else. The only thing that might overlap is memory reads. for my tests I loop the same 2 kernel calls for a while and get the timing. There is no difference if I remove the timer calls either. No memory is allocated and uploaded within the iteration. My problem does not get affected by the size of the thread or grid array (tested with {128,128}, {256, 256}, and {512, 512}.

Here are screenshots of the timings WITHOUT any sync’ing

External Media

You can notice that the first kernel call is slow (as is expected) and the immediate others are faster. Then for some unknown reason it goes very slow. Now each one of these gives the correct result at the end… There is no difference between the iterations, no memory is copied/allocated either - each iteration does exactly the same thing. Yet, it slows down. Its almost as if CUDA is forcing thread sync between kernel calls.

Here are timings with Sync at end of each iteration.

External Media

This is with a cudaThreadSynchronize() at the end of each iteration.

As far as I can see their isn’t a reason why it should slow down. I get the correct results without thread sync and I really need it to be fast. Hope you can help.

In addition if I comment out one of the kernel functions (just as a test) the slowdown does not begin until around the 14th iteration, while its around 7 with both kernels.

Best Regards,

Meirion.

AndreiB · February 8, 2008, 2:36pm

Your kernel takes about 130-160 ms to complete.
Without cudaThreadSynchronize() you do not time kernel running time, just invokations: calling kernel is async. After 7th or so call internal queue is full and driver performs implicit synchronization.

Hope this helps.

trex · February 8, 2008, 2:44pm

Ahh, I see. I thought as much but I still had it in my head that it came back to the C code only when the function finished. Oh well… I guess I’ll have to optimise it elsewhere.

Thanks for clearing that up.

MisterAnderson42 · February 8, 2008, 2:59pm

This is expected behavior. If you perform a time measurement without syncing, you will be taking that time measurement at the BEGINNING of the kernel execution, not the end. Calling a kernel just queues it up for execution on the device and the CPU continues executing. If you perform a memcpy or opengl interop call or anything that might try to actually use the data you calculate in the kernel, it will wait for the kernel to finish first.

The fact that the “slowdown” occurs for you after 7 iterations concurs with my own tests that show they async launch queue depth is 16 kernels.

Edit: too slow :(

Topic		Replies	Views
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10572	June 21, 2009
Double For Loop Very Slow CUDA Programming and Performance	8	4463	August 20, 2008
the same thing, different time consuming asking for help CUDA Programming and Performance	5	6214	May 26, 2009
Why CUDA kernel calls takes so long? CUDA Programming and Performance	2	1443	July 17, 2017
Inconsistent kernel run times CUDA Programming and Performance	12	5793	August 5, 2009
Kernels and For Loops CUDA Programming and Performance	2	4077	April 4, 2008
Kernel Timing and cudaThreadSynchronize() CUDA Programming and Performance	6	2002	July 30, 2010
is cudaThreadSynchronize() will take 600+ms to execute? CUDA Programming and Performance	3	1539	April 21, 2009
Extremely high number of iterations CUDA Programming and Performance	5	1329	February 14, 2013
Sequential call of kernels CUDA Programming and Performance	2	2950	November 27, 2007

Odd Slowdown Problem Same function slows down in loop

Related topics