Hi all!
I’m developing a CUDA based project, and I have some problems with timings and overheads. Let me explain.
I have a loop in host code that executes several kernels. I tried to check how big is the overhead on launching some of these kernels, so I made sthg like this:
(this is only a part of the loop)
[codebox]// execute kernels
cutilCheckError( cutResetTimer( timer1 ) );
cutilCheckError( cutResetTimer( timer2 ) );
cutilCheckError( cutStartTimer( timer1 ) );
cutilCheckError( cutStartTimer( timer2 ) );
// 1. GAMMA → result in d_image
GammaNormalize<<<gammaDimGrid, gammaDimBlock, 0, stream[0]>>>(d_rawImage[currBuff], d_image, currentImgSize);
cutilCheckMsg(“GammaNormalize() execution failed\n”);
// 2. CONVOLUTION (row and column)
ConvolutionRow<<<convolBlockGridRows, convolThreadBlockRows, 0, stream[0]>>>(d_dxImage, d_image, currentImgWidth, currentImgHeight);
cutilCheckMsg(“convolutionRow() execution failed\n”);
ConvolutionColumn<<<convolBlockGridColumns, convolThreadBlockColumns, 0, stream[0]>>>(d_dyImage, d_image,
currentImgWidth,
currentImgHeight,
CONVOL_COLUMN_TILE_W * convolThreadBlockColumns.y,
currentImgWidth * convolThreadBlockColumns.y);
cutilCheckMsg(“convolutionColumn() execution failed\n”);
// 3. GRADIENTS
CreateGradients<<<gradDimGrid, gradDimBlock, 0, stream[0]>>>(d_dxImage, d_dyImage, d_gradImage, currentImgSize);
cutilCheckMsg(“createGradients() execution failed\n”);
cutilCheckError( cutStopTimer( timer1 ) );
cutilSafeCall( cudaStreamSynchronize(stream[0]) );
cutilCheckError( cutStopTimer( timer2 ) );
printf(“timer1: %f\ttimer2: %f\n”, cutGetTimerValue(timer1), cutGetTimerValue(timer2)); [/codebox]
In each loop the data to process by the kernels gets smaller and smaller, so the kernels should be able to process the data faster.
The results I’m getting are very strange though - in each loop both of the timer values decrease, and they stay almost identical! Here’s a sample:
timer1: 0.824346 timer2: 0.826464
timer1: 0.795566 timer2: 0.796336
timer1: 0.722341 timer2: 0.723090
timer1: 0.671594 timer2: 0.672464
timer1: 0.635654 timer2: 0.636434
timer1: 0.579669 timer2: 0.580519
timer1: 0.507759 timer2: 0.508539
timer1: 0.457830 timer2: 0.458486
timer1: 0.436223 timer2: 0.436842
timer1: 0.414291 timer2: 0.415033
timer1: 0.413574 timer2: 0.414233
timer1: 0.375591 timer2: 0.376296
timer1: 0.408135 timer2: 0.408880
timer1: 0.343531 timer2: 0.344303
timer1: 0.325699 timer2: 0.326291
timer1: 0.301697 timer2: 0.302424
timer1: 0.322406 timer2: 0.323133
timer1: 0.289581 timer2: 0.290446
timer1: 0.260193 timer2: 0.260937
timer1: 0.270434 timer2: 0.271263
timer1: 0.231614 timer2: 0.232366
timer1: 0.213887 timer2: 0.214629
timer1: 0.208784 timer2: 0.209609
timer1: 0.190950 timer2: 0.191697
timer1: 0.209802 timer2: 0.210589
timer1: 0.199977 timer2: 0.200807
If I understand the whole thing correctly, kernel launches are asynchronous, so timer1 should measure the launch overhead (which should be similar each time, right?), and timer2 should measure the actual execution time.
Could anyone explain that to me please ? :)
edit: I tried it both in debug and release modes… the same effect.