weird thing about timing a function in cuda code

xiaolaji · January 14, 2010, 9:59am

Hi everyone:

I run the “separable convolution code”(included in the sdk example) today. And I want to time how much time it costs.

The following is my code:

CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutResetTimer(hTimer) );
CUT_SAFE_CALL( cutStartTimer(hTimer) );
for(hoho = 0; hoho < numInteration; hoho ++)
{
CUDA_SAFE_CALL( cudaMemcpyToSymbol(d_Kernel_X, h_Kernel_x, KERNEL_SIZE_X) );
CUDA_SAFE_CALL( cudaMemcpyToSymbol(d_Kernel_Y, h_Kernel_y, KERNEL_SIZE_Y) );

convolutionRowGPU<<<blockGridRows, threadBlockRows>>>(
        d_DataB,
        d_DataA,
        DATA_W,
        DATA_H
   );
  CUT_CHECK_ERROR("convolutionRowGPU() execution failed\n");

  convolutionColumnGPU<<<blockGridColumns, threadBlockColumns>>>(
        d_Temp,
        d_DataB,
        DATA_W,
        DATA_H,
        COLUMN_TILE_W * threadBlockColumns.y,
        DATA_W * threadBlockColumns.y
  );

CUT_CHECK_ERROR(“convolutionColumnGPU() execution failed\n”);

}
CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL(cutStopTimer(hTimer));

gpuTime = cutGetTimerValue(hTimer) / (float)numInteration;
printf(“GPU anigaussian time : %f msec //%f Mpixels/sec\n, i = %d”, gpuTime, 1e-6 * DATA_W * DATA_H / (gpuTime * 0.001), hoho);

The result is that: the whole process(memory copy & rowfilter && columnfilter) costs 0.335ms. And I time them separably. Strange things happen!
Memory copy costs 0.102ms, while rowfilter&columnfilter costs 0.117ms, which mean that 0.102 + 0.117 != 0.335. External Image

I have no idea of what happens. Do you have some explanation?

I put the “cudaThreadSynchronize()” before and afterthe kernel execution.

:mellow: Time the memory copy:

CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutResetTimer(hTimer) );
CUT_SAFE_CALL( cutStartTimer(hTimer) );
for(hoho = 0; hoho < numInteration; hoho ++)
{
CUDA_SAFE_CALL( cudaMemcpyToSymbol(d_Kernel_X, h_Kernel_x, KERNEL_SIZE_X) );
CUDA_SAFE_CALL( cudaMemcpyToSymbol(d_Kernel_Y, h_Kernel_y, KERNEL_SIZE_Y) );

}
CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL(cutStopTimer(hTimer));

gpuTime = cutGetTimerValue(hTimer) / (float)numInteration;
printf(“GPU anigaussian time : %f msec //%f Mpixels/sec\n, i = %d”, gpuTime, 1e-6 * DATA_W * DATA_H / (gpuTime * 0.001), hoho);

:mellow: Time rowfilter&columnfilter:

CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutResetTimer(hTimer) );
CUT_SAFE_CALL( cutStartTimer(hTimer) );
for(hoho = 0; hoho < numInteration; hoho ++)
{
convolutionRowGPU<<<blockGridRows, threadBlockRows>>>(
d_DataB,
d_DataA,
DATA_W,
DATA_H
);
CUT_CHECK_ERROR(“convolutionRowGPU() execution failed\n”);

  convolutionColumnGPU<<<blockGridColumns, threadBlockColumns>>>(
        d_Temp,
        d_DataB,
        DATA_W,
        DATA_H,
        COLUMN_TILE_W * threadBlockColumns.y,
        DATA_W * threadBlockColumns.y
  );

CUT_CHECK_ERROR(“convolutionColumnGPU() execution failed\n”);

}
CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL(cutStopTimer(hTimer));

gpuTime = cutGetTimerValue(hTimer) / (float)numInteration;
printf(“GPU anigaussian time : %f msec //%f Mpixels/sec\n, i = %d”, gpuTime, 1e-6 * DATA_W * DATA_H / (gpuTime * 0.001), hoho);

nitin.life · January 14, 2010, 10:24pm

Hi everyone:

I run the “separable convolution code”(included in the sdk example) today. And I want to time how much time it costs.

The following is my code:

CUDA_SAFE_CALL( cudaThreadSynchronize() );

CUT_SAFE_CALL( cutResetTimer(hTimer) );

CUT_SAFE_CALL( cutStartTimer(hTimer) );

for(hoho = 0; hoho < numInteration; hoho ++)

{
CUDA_SAFE_CALL( cudaMemcpyToSymbol(d_Kernel_X, h_Kernel_x, KERNEL_SIZE_X) );

CUDA_SAFE_CALL( cudaMemcpyToSymbol(d_Kernel_Y, h_Kernel_y, KERNEL_SIZE_Y) );

convolutionRowGPU<<<blockGridRows, threadBlockRows>>>(

        d_DataB,

        d_DataA,

        DATA_W,

        DATA_H

   );

  CUT_CHECK_ERROR("convolutionRowGPU() execution failed\n");
convolutionColumnGPU<<<blockGridColumns, threadBlockColumns>>>(
        d_Temp,

        d_DataB,

        DATA_W,

        DATA_H,

        COLUMN_TILE_W * threadBlockColumns.y,

        DATA_W * threadBlockColumns.y

  );
CUT_CHECK_ERROR(“convolutionColumnGPU() execution failed\n”);

}

CUDA_SAFE_CALL( cudaThreadSynchronize() );

CUT_SAFE_CALL(cutStopTimer(hTimer));

gpuTime = cutGetTimerValue(hTimer) / (float)numInteration;

printf(“GPU anigaussian time : %f msec //%f Mpixels/sec\n, i = %d”, gpuTime, 1e-6 * DATA_W * DATA_H / (gpuTime * 0.001), hoho);

The result is that: the whole process(memory copy & rowfilter && columnfilter) costs 0.335ms. And I time them separably. Strange things happen!

Memory copy costs 0.102ms, while rowfilter&columnfilter costs 0.117ms, which mean that 0.102 + 0.117 != 0.335. External Media

I have no idea of what happens. Do you have some explanation?

I put the “cudaThreadSynchronize()” before and afterthe kernel execution.

:mellow: Time the memory copy:

CUDA_SAFE_CALL( cudaThreadSynchronize() );

CUT_SAFE_CALL( cutResetTimer(hTimer) );

CUT_SAFE_CALL( cutStartTimer(hTimer) );

for(hoho = 0; hoho < numInteration; hoho ++)

{
CUDA_SAFE_CALL( cudaMemcpyToSymbol(d_Kernel_X, h_Kernel_x, KERNEL_SIZE_X) );

CUDA_SAFE_CALL( cudaMemcpyToSymbol(d_Kernel_Y, h_Kernel_y, KERNEL_SIZE_Y) );	
}

CUDA_SAFE_CALL( cudaThreadSynchronize() );

CUT_SAFE_CALL(cutStopTimer(hTimer));

gpuTime = cutGetTimerValue(hTimer) / (float)numInteration;

printf(“GPU anigaussian time : %f msec //%f Mpixels/sec\n, i = %d”, gpuTime, 1e-6 * DATA_W * DATA_H / (gpuTime * 0.001), hoho);

:mellow: Time rowfilter&columnfilter:

CUDA_SAFE_CALL( cudaThreadSynchronize() );

CUT_SAFE_CALL( cutResetTimer(hTimer) );

CUT_SAFE_CALL( cutStartTimer(hTimer) );

for(hoho = 0; hoho < numInteration; hoho ++)

{
convolutionRowGPU<<<blockGridRows, threadBlockRows>>>(

        d_DataB,

        d_DataA,

        DATA_W,

        DATA_H

   );

  CUT_CHECK_ERROR("convolutionRowGPU() execution failed\n");
convolutionColumnGPU<<<blockGridColumns, threadBlockColumns>>>(
        d_Temp,

        d_DataB,

        DATA_W,

        DATA_H,

        COLUMN_TILE_W * threadBlockColumns.y,

        DATA_W * threadBlockColumns.y

  );
CUT_CHECK_ERROR(“convolutionColumnGPU() execution failed\n”);

}

CUDA_SAFE_CALL( cudaThreadSynchronize() );

CUT_SAFE_CALL(cutStopTimer(hTimer));

gpuTime = cutGetTimerValue(hTimer) / (float)numInteration;

printf(“GPU anigaussian time : %f msec //%f Mpixels/sec\n, i = %d”, gpuTime, 1e-6 * DATA_W * DATA_H / (gpuTime * 0.001), hoho);

I don’t understand what problem are you referring to ?

xiaolaji · January 15, 2010, 2:37am

I means that the whole process includes two parts, memory copy and filter. If I time the whole process, it costs 0.335ms. If I time the two parts separably, memory copy costs 0.102ms, while filter costs 0.117ms. In my expectation, the time cost of memory copy plus the time cost of filter should equal the time cost of whole process. But the result is actually against my expectation, because (0.102 + 0.117) != 0.335

xiaolaji · January 15, 2010, 2:38am

nitin.life · January 15, 2010, 3:55am

Try using cuda events to time your code… that’s the standard and much more reliable way to time your GPU calls. What results you get then ?

Topic		Replies	Views
How properly counting a performance/program time ? CUDA Programming and Performance	4	2568	August 28, 2007
Kernel can not run parallelly with CPU codes OK in XP, failed in Vista CUDA Programming and Performance	4	8689	December 5, 2008
Why is cudaThreadSynchronize() so expensive? CUDA Programming and Performance	7	2075	October 21, 2010
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9346	January 7, 2008
Inconsistent kernel run times CUDA Programming and Performance	12	5784	August 5, 2009
DATA tranfer from CPU to GPU CUDA Programming and Performance	6	4807	April 23, 2008
Timing the code CUDA Programming and Performance	5	4708	July 28, 2011
Cuda program taking more time. CUDA Programming and Performance	15	7059	November 21, 2010
Time consuming comparison between 820M GPU combination code and pure C++ on CPU CUDA Programming and Performance	2	488	May 22, 2019
problem in timing of GPU work CUDA Programming and Performance	5	812	September 11, 2015

weird thing about timing a function in cuda code

Related topics