Hello to all,
I am profiling my GPU kernels, but I noticed a strange behavior.
The elapsed time just for reading values from global memory to shared memory and writing from shared to global
is bigger (double!!!) than if I also compute values.
The kernel load values from a 2d pitched linear global memory into tiles of shared memory.
__global__ void compute( REAL4* vIn,
REAL4* vOut)
{
//-- thread index
int tid_x = blockIdx.x*blockDim.x + threadIdx.x;
int tid_y = blockIdx.y*blockDim.y + threadIdx.y;
REAL4 F = make_float4(0.0f,0.0f,0.0f,1.0f);
//-- gather
gatherToSharedMemory2D(vIn, tid_y, tid_x);
__syncthreads();
//-- compute
if ((tid_x < d_cst_width) && (tid_y < d_cst_height))
{
evaluate(&F, tid_y, tid_x);
}
//-- scatter
if ((tid_x < d_cst_width) && (tid_y < d_cst_height))
{
*(vOut + tid_y*d_cst_width+tid_x)=F;
}
}
If I comment out the compute section, compute time increases.
Could you suggest an explanation?
Thanks in advance
Francesco