faster at small runtimes, slower for larger runtimes

(cuda C newbie here…)

So!, I my code is actually a kernel, using two other device functions, run in a for loop that includes only this kernel and a memory transfer. When I execute this code for 500 loops, it runs faster than it does on the conventional cpu code, but when I execute it for 50000 loops, it is slower.

I am using this cpu timer :

#include <time.h>

clock_t start = clock();

...

...

cout << "elapsed time" << ( (double)clock() - start ) / CLOCKS_PER_SEC;

which I don’t know how accurate it is - and I actually don’t care, I want it only for comparisons - , but I’ve compared it with a timewatch and it is acceptable for a two-minute interval. I can’t measure the “under a second” timings though.

for the 500 loops the timing results are :

CPU: 0,33 GPU: 0,01

for the 50000 loops I get

CPU : 27,43 GPU: 96,82

I thought It might be a memory leak slowing things down, but I don’t see anywhere in the programming guide a part of a code that frees the memory of the variables allocated in the global or the device functions.

any advice?? what to look for???

Thanks in advance

I know that slow execution times are a common issue that can be dealt with better memory management, use of the occupancy calculator to improve the kernel’s execution setup etc… The odd thing in my case is that when the application is run for 500 loops, it is much faster, as it is supposed to be…

(this is a bump to attract some attention… thanks for reading, I hope you answer too :-) )