Hi,
I have an odd problem and its really hurting my application time.
So my problem is this: I can run the exact same kernels 7 times and they are very fast. But there after the seventh iteration, the time taken increases tremendously.
My program Loads some data into cuda. The memory is all padded such that it is 2^N and aligned. My program has two kernel functions which are called one after the other. The first function prepares a data array (in device memory) which is then used in the second kernel. there is no thread sync at all as each thread is independent of everything else. The only thing that might overlap is memory reads. for my tests I loop the same 2 kernel calls for a while and get the timing. There is no difference if I remove the timer calls either. No memory is allocated and uploaded within the iteration. My problem does not get affected by the size of the thread or grid array (tested with {128,128}, {256, 256}, and {512, 512}.
Here are screenshots of the timings WITHOUT any sync’ing
You can notice that the first kernel call is slow (as is expected) and the immediate others are faster. Then for some unknown reason it goes very slow. Now each one of these gives the correct result at the end… There is no difference between the iterations, no memory is copied/allocated either - each iteration does exactly the same thing. Yet, it slows down. Its almost as if CUDA is forcing thread sync between kernel calls.
Here are timings with Sync at end of each iteration.
This is with a cudaThreadSynchronize() at the end of each iteration.
As far as I can see their isn’t a reason why it should slow down. I get the correct results without thread sync and I really need it to be fast. Hope you can help.
In addition if I comment out one of the kernel functions (just as a test) the slowdown does not begin until around the 14th iteration, while its around 7 with both kernels.
Best Regards,
Meirion.