Kernel execution overhead

Hi all!

I’m developing a CUDA based project, and I have some problems with timings and overheads. Let me explain.

I have a loop in host code that executes several kernels. I tried to check how big is the overhead on launching some of these kernels, so I made sthg like this:

(this is only a part of the loop)

[codebox]// execute kernels

cutilCheckError( cutResetTimer( timer1 ) );

cutilCheckError( cutResetTimer( timer2 ) );

cutilCheckError( cutStartTimer( timer1 ) );

cutilCheckError( cutStartTimer( timer2 ) );

// 1. GAMMA -> result in d_image

GammaNormalize<<<gammaDimGrid, gammaDimBlock, 0, stream[0]>>>(d_rawImage[currBuff], d_image, currentImgSize);

cutilCheckMsg(“GammaNormalize() execution failed\n”);

// 2. CONVOLUTION (row and column)

ConvolutionRow<<<convolBlockGridRows, convolThreadBlockRows, 0, stream[0]>>>(d_dxImage, d_image, currentImgWidth, currentImgHeight);

cutilCheckMsg(“convolutionRow() execution failed\n”);

ConvolutionColumn<<<convolBlockGridColumns, convolThreadBlockColumns, 0, stream[0]>>>(d_dyImage, d_image,

									currentImgWidth, 

									currentImgHeight, 

									CONVOL_COLUMN_TILE_W * convolThreadBlockColumns.y,

									currentImgWidth * convolThreadBlockColumns.y);

cutilCheckMsg(“convolutionColumn() execution failed\n”);

// 3. GRADIENTS

CreateGradients<<<gradDimGrid, gradDimBlock, 0, stream[0]>>>(d_dxImage, d_dyImage, d_gradImage, currentImgSize);

cutilCheckMsg(“createGradients() execution failed\n”);

cutilCheckError( cutStopTimer( timer1 ) );

cutilSafeCall( cudaStreamSynchronize(stream[0]) );

cutilCheckError( cutStopTimer( timer2 ) );

printf(“timer1: %f\ttimer2: %f\n”, cutGetTimerValue(timer1), cutGetTimerValue(timer2)); [/codebox]

In each loop the data to process by the kernels gets smaller and smaller, so the kernels should be able to process the data faster.

The results I’m getting are very strange though - in each loop both of the timer values decrease, and they stay almost identical! Here’s a sample:

timer1: 0.824346 timer2: 0.826464

timer1: 0.795566 timer2: 0.796336

timer1: 0.722341 timer2: 0.723090

timer1: 0.671594 timer2: 0.672464

timer1: 0.635654 timer2: 0.636434

timer1: 0.579669 timer2: 0.580519

timer1: 0.507759 timer2: 0.508539

timer1: 0.457830 timer2: 0.458486

timer1: 0.436223 timer2: 0.436842

timer1: 0.414291 timer2: 0.415033

timer1: 0.413574 timer2: 0.414233

timer1: 0.375591 timer2: 0.376296

timer1: 0.408135 timer2: 0.408880

timer1: 0.343531 timer2: 0.344303

timer1: 0.325699 timer2: 0.326291

timer1: 0.301697 timer2: 0.302424

timer1: 0.322406 timer2: 0.323133

timer1: 0.289581 timer2: 0.290446

timer1: 0.260193 timer2: 0.260937

timer1: 0.270434 timer2: 0.271263

timer1: 0.231614 timer2: 0.232366

timer1: 0.213887 timer2: 0.214629

timer1: 0.208784 timer2: 0.209609

timer1: 0.190950 timer2: 0.191697

timer1: 0.209802 timer2: 0.210589

timer1: 0.199977 timer2: 0.200807

If I understand the whole thing correctly, kernel launches are asynchronous, so timer1 should measure the launch overhead (which should be similar each time, right?), and timer2 should measure the actual execution time.

Could anyone explain that to me please ? :)

edit: I tried it both in debug and release modes… the same effect.

Can you average over a number of calls? The library on which cutil ultimately depends is not really accurate to the microsecond level. In my tests, I’ve found that the latency for a PCIe copy is on the order of 10 microseconds (on my motherboard), and other posters have found a similar latency for kernel launches.

Thanks for a fast reply!

I’ve put the calls I showed in a for loop that executes 1000 times (debug mode) and the results are:
timer1: 813.954163 timer2: 813.959656
timer1: 759.677673 timer2: 759.679321
timer1: 676.974365 timer2: 676.975769
timer1: 631.808044 timer2: 631.809509
timer1: 575.362976 timer2: 575.364380
timer1: 527.082947 timer2: 527.084534
timer1: 468.326721 timer2: 468.328033
timer1: 426.717712 timer2: 426.718903
timer1: 412.975494 timer2: 412.976624
timer1: 382.578186 timer2: 382.579285
timer1: 409.975159 timer2: 409.976929
timer1: 359.730316 timer2: 359.731873
timer1: 327.295349 timer2: 327.296753
timer1: 302.982880 timer2: 302.984070
timer1: 285.976837 timer2: 285.978119
timer1: 269.095581 timer2: 269.096649
timer1: 249.329086 timer2: 249.329895
timer1: 270.317993 timer2: 270.319092
timer1: 218.663742 timer2: 218.664825
timer1: 220.617828 timer2: 220.619232
timer1: 201.671463 timer2: 201.672638
timer1: 181.915573 timer2: 181.916794
timer1: 172.801559 timer2: 172.803024
timer1: 199.936600 timer2: 199.938278
timer1: 157.918091 timer2: 157.919174
timer1: 155.149612 timer2: 155.150772

Seems to me that the results are around 1000x larger than with a single execution…