bigger computation time for less operations

Hello to all,

I am profiling my GPU kernels, but I noticed a strange behavior.

The elapsed time just for reading values from global memory to shared memory and writing from shared to global

is bigger (double!!!) than if I also compute values.

The kernel load values from a 2d pitched linear global memory into tiles of shared memory.

__global__ void compute(   REAL4* vIn, 

			   REAL4* vOut)


   //-- thread index

   int tid_x = blockIdx.x*blockDim.x + threadIdx.x;

   int tid_y = blockIdx.y*blockDim.y + threadIdx.y;

   REAL4 F = make_float4(0.0f,0.0f,0.0f,1.0f);

//-- gather

   gatherToSharedMemory2D(vIn, tid_y, tid_x);


//-- compute

   if ((tid_x < d_cst_width) && (tid_y < d_cst_height))


      evaluate(&F, tid_y, tid_x);


//-- scatter

   if ((tid_x < d_cst_width) && (tid_y < d_cst_height))


      *(vOut + tid_y*d_cst_width+tid_x)=F;



If I comment out the compute section, compute time increases.

Could you suggest an explanation?

Thanks in advance