Hi There
Im running performnmce tests for Cuda kernels
My Kernel computes a 3x3 filter over a Image, and in order to check for perfromance issues im taking the time befor ethe kernel is executed, and after, and take the differewnce as actual computation time.
This works fine
In Order to get a mean average compuattion time im running the Kernel 500 times after each other
so i compute my filtered pciture 500 times.
Now if im taking the time again, and divide the result by 500 in order to get the average / kernel run, it happens to me that the average compuattion time decreases with increasing amount of kernel launches
so if one kernel launch took 600ms, the average run with 500 kernel launches only takes 19ms!!
Im Just wondering, how could this be?
Im Confused! Any hints?
For one execution of doCudaStuff i initilize a new array on the GPU, and transfer the data from pc to gpu and lateron release the memory again: the complete setup
this is my sort of code
unsigned int start = clock();
for(int i = 0; i < NUMBER_OF_RUNS; i++){
doCudaStuff(data,data_ref, width,height); ← external C++ Function, executes kernel, alloctaes memory, cudamemcxopies from CPU → GPU → CPU, frees memory
}
unsigned int end = clock();
std::cout << "start " << start << “\n”;
std::cout << "End: " << end << “\n”;
std::cout << "Time taken in millisecs: " << end-start;
double delay = (double)(end - start)/NUMBER_OF_RUNS; <-- this gets sm,aller and smalle with increasing number of Runs