Hi all,
I’m not clear about the execute time of CUDA kernel, just below program
__global__ void convolutionRowGPU(
uint4 *d_Result,
int dataW,
)
{
const int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;
const int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;
const float ex = (float)IMUL(ix, 16)+0.5;
const float ey = (float)iy + 0.5f;
uchar4 sum=make_uchar4(0,0,0,0);
sum.x = convolutionRow<KERNEL_DIAMETER>(ex, ey);
sum.y = convolutionRow<KERNEL_DIAMETER>(ex+1, ey);
sum.z = convolutionRow<KERNEL_DIAMETER>(ex+2, ey);
sum.w = convolutionRow<KERNEL_DIAMETER>(ex+3, ey);
d_Result[IMUL(iy, dataW) + ix] = sum;
}
This kernel need 400us, but when I measure the below kernel, only omit d_Result output, the time is just 60us.
__global__ void convolutionRowGPU(
uint4 *d_Result,
int dataW,
)
{
const int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;
const int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;
const float ex = (float)IMUL(ix, 16)+0.5;
const float ey = (float)iy + 0.5f;
uchar4 sum=make_uchar4(0,0,0,0);
sum.x = convolutionRow<KERNEL_DIAMETER>(ex, ey);
sum.y = convolutionRow<KERNEL_DIAMETER>(ex+1, ey);
sum.z = convolutionRow<KERNEL_DIAMETER>(ex+2, ey);
sum.w = convolutionRow<KERNEL_DIAMETER>(ex+3, ey);
}
And when I measure the below kernel, don’t compute convolution, just output result, the spent time is just 65us,
__global__ void convolutionRowGPU(
uint4 *d_Result,
int dataW,
)
{
const int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;
const int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;
const float ex = (float)IMUL(ix, 16)+0.5;
const float ey = (float)iy + 0.5f;
uchar4 sum=make_uchar4(0,0,0,0);
d_Result[IMUL(iy, dataW) + ix] = sum;
}
Now, I am confused that why the total time is not 60+65=120, but is about 400us. Can you give me some help about this? Is there any fixed overhead in this kernel?
Thanks very much