(cuda C newbie here…)
So!, I my code is actually a kernel, using two other device functions, run in a for loop that includes only this kernel and a memory transfer. When I execute this code for 500 loops, it runs faster than it does on the conventional cpu code, but when I execute it for 50000 loops, it is slower.
I am using this cpu timer :
#include <time.h>
clock_t start = clock();
...
...
cout << "elapsed time" << ( (double)clock() - start ) / CLOCKS_PER_SEC;
which I don’t know how accurate it is - and I actually don’t care, I want it only for comparisons - , but I’ve compared it with a timewatch and it is acceptable for a two-minute interval. I can’t measure the “under a second” timings though.
for the 500 loops the timing results are :
CPU: 0,33 GPU: 0,01
for the 50000 loops I get
CPU : 27,43 GPU: 96,82
I thought It might be a memory leak slowing things down, but I don’t see anywhere in the programming guide a part of a code that frees the memory of the variables allocated in the global or the device functions.
any advice?? what to look for???
Thanks in advance