Strange Runtime behavior

Hi There

Im running performnmce tests for Cuda kernels

My Kernel computes a 3x3 filter over a Image, and in order to check for perfromance issues im taking the time befor ethe kernel is executed, and after, and take the differewnce as actual computation time.

This works fine

In Order to get a mean average compuattion time im running the Kernel 500 times after each other
so i compute my filtered pciture 500 times.

Now if im taking the time again, and divide the result by 500 in order to get the average / kernel run, it happens to me that the average compuattion time decreases with increasing amount of kernel launches

so if one kernel launch took 600ms, the average run with 500 kernel launches only takes 19ms!!

Im Just wondering, how could this be?

Im Confused! Any hints?

For one execution of doCudaStuff i initilize a new array on the GPU, and transfer the data from pc to gpu and lateron release the memory again: the complete setup

this is my sort of code

unsigned int start = clock();
for(int i = 0; i < NUMBER_OF_RUNS; i++){
doCudaStuff(data,data_ref, width,height); ← external C++ Function, executes kernel, alloctaes memory, cudamemcxopies from CPU → GPU → CPU, frees memory
}
unsigned int end = clock();
std::cout << "start " << start << “\n”;
std::cout << "End: " << end << “\n”;
std::cout << "Time taken in millisecs: " << end-start;

double delay = (double)(end - start)/NUMBER_OF_RUNS;  <--  this gets sm,aller and smalle with increasing number of Runs

I posted it 2 times :(

try to remove warm-up time and add cudaThreadSynchronize()

// remove warmup time

doCudaStuff(data,data_ref, width,height); 

unsigned int start = clock();

for(int i = 0; i < NUMBER_OF_RUNS; i++){

	doCudaStuff(data,data_ref, width,height); <-- external C++ Function, executes kernel

}

cudaThreadSynchronize();

unsigned int end = clock();

std::cout << "start " << start << "\n";

std::cout << "End: " << end << "\n";

std::cout << "Time taken in millisecs: " << end-start;

what do you mean by warm up time??

Why shoudl i use __synchthreads here?

I thought the Control would be handed over to the pc once all 500 kernel runs have completed, why use syncthreads ?

Kernel execution is asynchronous. The kernel calls return as soon as they have been queued for execution, not when they have finished executing. If you want accurate timing, you need to have the host spinlock until the kernels are finished. That is what the cudaThreadSynchronize() call is for.

Warm up time is time associated with establishing a context and initializing everything so that the host can start running kernels. You should really take that time out of timing statistics, because it only effects the initial cuda operation, not subsequent ones.

Thank you very much, the results seem better now…
I places the CudaSynchThreads in the line beneath the Kernel Excecution frapped in that FOR Loop

One unclear thing still remains though… In my results time per Kernel execution decreases with increasing amount of kernel launches, wouldnt one expect an Increase per Excecution? The more Kernel launches, the more Time they need to be finished?

Regards, Maz

I suspect some kind of overlapping there… not sure though… as I haven’t looked at our code…

plus… when you time the first time… just time the kernel only and check if you see a change in that time. Cause multiple kernel launches with only one device to memory and vice0versa transfer will show the effect on timing which you are seeing…

Thank you for your response!
This is very interessting, when trying to calculate teh kernel Time only, im getting the same result for start and end time when using this code in my .cu - File
the calculated span = end - start therefore is 0

Im Convoluting an IMage with (1024,768), the Kernel does as intended to do, so that works fine

            cudaMalloc((void **) &data_d, size);   // Allocate array on device
	 cudaMalloc((void **) &data_orig, size);
	 // Initialize host array and copy it to CUDA device
	 cudaMemcpy(data_orig, data_h, size, cudaMemcpyHostToDevice);
	 cudaMemcpy(data_d, data_h, size, cudaMemcpyHostToDevice);

	 // Do calculation on device:
	 int block_size = width-2;
	 int number_blocks = height-2;
              //TIMECOUNT
	 unsigned  int start = clock();
	 doMedianFilter3x3 <<<number_blocks,block_size>>>(data_orig,data_d,height,width);
	 cudaThreadSynchronize();
	 unsigned int end = clock();
	 int span = end - start;
	 printf("%d ", span);

EDIT: The code above is from my CU file, lets call it launch_the_medthod_in_cu_File(…) , defined as external “C” and launched from a cpp mainmethod

the Code Fragment in that main is

	 unsigned  int start = clock();

	launch_the_medthod_in_cu_File(...)

	 unsigned int end = clock();
	 int span = end - start;
	 printf("%d ", span);

Now start and end are different, but end is the same Time as start and end in launch_the_medthod_in_cu_File(…) ! thats why int span = end - start; gives 0 as result

I dont quite understand why, do you maybe?

Thx

Maz

Regards,
Maz