Odd Slowdown Problem Same function slows down in loop

Hi,

I have an odd problem and its really hurting my application time.

So my problem is this: I can run the exact same kernels 7 times and they are very fast. But there after the seventh iteration, the time taken increases tremendously.

My program Loads some data into cuda. The memory is all padded such that it is 2^N and aligned. My program has two kernel functions which are called one after the other. The first function prepares a data array (in device memory) which is then used in the second kernel. there is no thread sync at all as each thread is independent of everything else. The only thing that might overlap is memory reads. for my tests I loop the same 2 kernel calls for a while and get the timing. There is no difference if I remove the timer calls either. No memory is allocated and uploaded within the iteration. My problem does not get affected by the size of the thread or grid array (tested with {128,128}, {256, 256}, and {512, 512}.

Here are screenshots of the timings WITHOUT any sync’ing

External Media

You can notice that the first kernel call is slow (as is expected) and the immediate others are faster. Then for some unknown reason it goes very slow. Now each one of these gives the correct result at the end… There is no difference between the iterations, no memory is copied/allocated either - each iteration does exactly the same thing. Yet, it slows down. Its almost as if CUDA is forcing thread sync between kernel calls.

Here are timings with Sync at end of each iteration.

External Media

This is with a cudaThreadSynchronize() at the end of each iteration.

As far as I can see their isn’t a reason why it should slow down. I get the correct results without thread sync and I really need it to be fast. Hope you can help.

In addition if I comment out one of the kernel functions (just as a test) the slowdown does not begin until around the 14th iteration, while its around 7 with both kernels.

Best Regards,

Meirion.

Your kernel takes about 130-160 ms to complete.
Without cudaThreadSynchronize() you do not time kernel running time, just invokations: calling kernel is async. After 7th or so call internal queue is full and driver performs implicit synchronization.

Hope this helps.

Ahh, I see. I thought as much but I still had it in my head that it came back to the C code only when the function finished. Oh well… I guess I’ll have to optimise it elsewhere.

Thanks for clearing that up.

This is expected behavior. If you perform a time measurement without syncing, you will be taking that time measurement at the BEGINNING of the kernel execution, not the end. Calling a kernel just queues it up for execution on the device and the CPU continues executing. If you perform a memcpy or opengl interop call or anything that might try to actually use the data you calculate in the kernel, it will wait for the kernel to finish first.

The fact that the “slowdown” occurs for you after 7 iterations concurs with my own tests that show they async launch queue depth is 16 kernels.

Edit: too slow :(