Strange Runtime behavior

mazzok · December 16, 2009, 1:56pm

Hi There

Im running performnmce tests for Cuda kernels

My Kernel computes a 3x3 filter over a Image, and in order to check for perfromance issues im taking the time befor ethe kernel is executed, and after, and take the differewnce as actual computation time.

This works fine

In Order to get a mean average compuattion time im running the Kernel 500 times after each other
so i compute my filtered pciture 500 times.

Now if im taking the time again, and divide the result by 500 in order to get the average / kernel run, it happens to me that the average compuattion time decreases with increasing amount of kernel launches

so if one kernel launch took 600ms, the average run with 500 kernel launches only takes 19ms!!

Im Just wondering, how could this be?

Im Confused! Any hints?

For one execution of doCudaStuff i initilize a new array on the GPU, and transfer the data from pc to gpu and lateron release the memory again: the complete setup

this is my sort of code

unsigned int start = clock();
for(int i = 0; i < NUMBER_OF_RUNS; i++){
doCudaStuff(data,data_ref, width,height); ← external C++ Function, executes kernel, alloctaes memory, cudamemcxopies from CPU → GPU → CPU, frees memory
}
unsigned int end = clock();
std::cout << "start " << start << “\n”;
std::cout << "End: " << end << “\n”;
std::cout << "Time taken in millisecs: " << end-start;

double delay = (double)(end - start)/NUMBER_OF_RUNS;  <--  this gets sm,aller and smalle with increasing number of Runs

mazzok · December 16, 2009, 1:58pm

I posted it 2 times :(

LSChien · December 16, 2009, 2:07pm

try to remove warm-up time and add cudaThreadSynchronize()

// remove warmup time

doCudaStuff(data,data_ref, width,height); 

unsigned int start = clock();

for(int i = 0; i < NUMBER_OF_RUNS; i++){

	doCudaStuff(data,data_ref, width,height); <-- external C++ Function, executes kernel

}

cudaThreadSynchronize();

unsigned int end = clock();

std::cout << "start " << start << "\n";

std::cout << "End: " << end << "\n";

std::cout << "Time taken in millisecs: " << end-start;

mazzok · December 16, 2009, 2:44pm

what do you mean by warm up time??

Why shoudl i use __synchthreads here?

I thought the Control would be handed over to the pc once all 500 kernel runs have completed, why use syncthreads ?

avidday · December 16, 2009, 3:02pm

Kernel execution is asynchronous. The kernel calls return as soon as they have been queued for execution, not when they have finished executing. If you want accurate timing, you need to have the host spinlock until the kernels are finished. That is what the cudaThreadSynchronize() call is for.

Warm up time is time associated with establishing a context and initializing everything so that the host can start running kernels. You should really take that time out of timing statistics, because it only effects the initial cuda operation, not subsequent ones.

mazzok · December 16, 2009, 6:28pm

Thank you very much, the results seem better now…
I places the CudaSynchThreads in the line beneath the Kernel Excecution frapped in that FOR Loop

One unclear thing still remains though… In my results time per Kernel execution decreases with increasing amount of kernel launches, wouldnt one expect an Increase per Excecution? The more Kernel launches, the more Time they need to be finished?

Regards, Maz

nitin.life · December 16, 2009, 8:27pm

I suspect some kind of overlapping there… not sure though… as I haven’t looked at our code…

plus… when you time the first time… just time the kernel only and check if you see a change in that time. Cause multiple kernel launches with only one device to memory and vice0versa transfer will show the effect on timing which you are seeing…

mazzok · December 18, 2009, 1:52pm

Thank you for your response!
This is very interessting, when trying to calculate teh kernel Time only, im getting the same result for start and end time when using this code in my .cu - File
the calculated span = end - start therefore is 0

Im Convoluting an IMage with (1024,768), the Kernel does as intended to do, so that works fine

            cudaMalloc((void **) &data_d, size);   // Allocate array on device
	 cudaMalloc((void **) &data_orig, size);
	 // Initialize host array and copy it to CUDA device
	 cudaMemcpy(data_orig, data_h, size, cudaMemcpyHostToDevice);
	 cudaMemcpy(data_d, data_h, size, cudaMemcpyHostToDevice);

	 // Do calculation on device:
	 int block_size = width-2;
	 int number_blocks = height-2;
              //TIMECOUNT
	 unsigned  int start = clock();
	 doMedianFilter3x3 <<<number_blocks,block_size>>>(data_orig,data_d,height,width);
	 cudaThreadSynchronize();
	 unsigned int end = clock();
	 int span = end - start;
	 printf("%d ", span);

EDIT: The code above is from my CU file, lets call it launch_the_medthod_in_cu_File(…) , defined as external “C” and launched from a cpp mainmethod

the Code Fragment in that main is

	 unsigned  int start = clock();

	launch_the_medthod_in_cu_File(...)

	 unsigned int end = clock();
	 int span = end - start;
	 printf("%d ", span);

Now start and end are different, but end is the same Time as start and end in launch_the_medthod_in_cu_File(…) ! thats why int span = end - start; gives 0 as result

I dont quite understand why, do you maybe?

Thx

Maz

Regards,
Maz

Topic		Replies	Views
Inconsistent kernel run times CUDA Programming and Performance	12	5781	August 5, 2009
Getting Different Execution Times of Running Same Kernel Twice CUDA Programming and Performance	1	19	August 13, 2024
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10571	June 21, 2009
cost for launching (a lot of) CUDA kernels CUDA Programming and Performance	5	9689	April 15, 2010
Kernel problem, execution stop after ~15min CUDA Programming and Performance	7	1758	November 4, 2016
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7734	December 5, 2008
Inconsistent CUDA Kernel Execution Times in Sequential Execution CUDA Programming and Performance cuda	6	190	June 11, 2024
Kernel can not run parallelly with CPU codes OK in XP, failed in Vista CUDA Programming and Performance	4	8689	December 5, 2008
Overlapping GPU and CPU computation? CUDA Programming and Performance	9	1243	November 19, 2010
shared memory performance kernel execution timings with one block CUDA Programming and Performance	3	3168	May 6, 2007

Strange Runtime behavior

Related topics