Kernel execute time kernel performance

Hi all,

I’m not clear about the execute time of CUDA kernel, just below program

__global__ void convolutionRowGPU(

	uint4 *d_Result,

	int dataW,

)

{

	const   int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;

	const   int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;

	const float ex = (float)IMUL(ix, 16)+0.5;

	const float ey = (float)iy + 0.5f;

	

	uchar4 sum=make_uchar4(0,0,0,0);

		sum.x = convolutionRow<KERNEL_DIAMETER>(ex, ey);

		sum.y = convolutionRow<KERNEL_DIAMETER>(ex+1, ey);

		sum.z = convolutionRow<KERNEL_DIAMETER>(ex+2, ey);

		sum.w = convolutionRow<KERNEL_DIAMETER>(ex+3, ey);

	   

   d_Result[IMUL(iy, dataW) + ix] = sum;

}

This kernel need 400us, but when I measure the below kernel, only omit d_Result output, the time is just 60us.

__global__ void convolutionRowGPU(

	uint4 *d_Result,

	int dataW,

)

{

	const   int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;

	const   int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;

	const float ex = (float)IMUL(ix, 16)+0.5;

	const float ey = (float)iy + 0.5f;

	

	uchar4 sum=make_uchar4(0,0,0,0);

		sum.x = convolutionRow<KERNEL_DIAMETER>(ex, ey);

		sum.y = convolutionRow<KERNEL_DIAMETER>(ex+1, ey);

		sum.z = convolutionRow<KERNEL_DIAMETER>(ex+2, ey);

		sum.w = convolutionRow<KERNEL_DIAMETER>(ex+3, ey);

	   

}

And when I measure the below kernel, don’t compute convolution, just output result, the spent time is just 65us,

__global__ void convolutionRowGPU(

	uint4 *d_Result,

	int dataW,

)

{

	const   int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;

	const   int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;

	const float ex = (float)IMUL(ix, 16)+0.5;

	const float ey = (float)iy + 0.5f;

	

	uchar4 sum=make_uchar4(0,0,0,0);

	   

   d_Result[IMUL(iy, dataW) + ix] = sum;

}

Now, I am confused that why the total time is not 60+65=120, but is about 400us. Can you give me some help about this? Is there any fixed overhead in this kernel?

Thanks very much

If you do not output anything in your kernel, the optimizer will throw away all computation. Only calculations necessary for calculation of things being output to global memory is actually run.