longer computation time for less operations already posted on general discussion topic

Hello to all,

I apologize for my cross-posting in General Discussion forum, but my topic fit better in this section.

I am profiling my GPU kernels, but I noticed a strange behavior.

The elapsed time just for reading values from global memory to shared memory and writing from shared to global

is bigger (double!!!) than if I also compute values.

The kernel load values from a 2d pitched linear global memory into tiles of shared memory.

__global__ void compute(   REAL4* vIn, 

                           REAL4* vOut)


   //-- thread index

   int tid_x = blockIdx.x*blockDim.x + threadIdx.x;

   int tid_y = blockIdx.y*blockDim.y + threadIdx.y;

   REAL4 F = make_float4(0.0f,0.0f,0.0f,1.0f);

//-- gather

   gatherToSharedMemory2D(vIn, tid_y, tid_x);


//-- compute

   if ((tid_x < d_cst_width) && (tid_y < d_cst_height))


      evaluate(&F, tid_y, tid_x);


//-- scatter

   if ((tid_x < d_cst_width) && (tid_y < d_cst_height))


      *(vOut + tid_y*d_cst_width+tid_x)=F;



If I comment out the compute section, compute time increases.

I get timings using cudaEvents,


   for (unsigned i=0; i!=uIteration; i++)







   float time;

   cudaEventElapsedTime( &time, startEvent, stopEvent);

Could you suggest me an explanation?

Thanks in advance


Your evaluate() function probably performs an out of bounds shared memory access, so when you compile it in, the kernel aborts early with an error. Because you have no error checking, you just don’t see the error.


Thank you for the response, first.

I am not sure your answer is right, because this code is a crucial part of a cloth simulator, and the visual results are the expected.

Further, I don’t get the point…if one thread perform an out of bounds access…the whole kernel launch crashes?

Could you suggest me a technique to prevent out of bounds access?

Thanks a lot


Yes. If any thread peforms an out of bounds shared memory access, the entire grid is aborted and the API will return an error. It is the only thing I can think of which would cause the run time with additional code to be smaller than the bare memory accesses.

Preventing it means fixing your code and I can’t tell you how to do that. But detection can be done by running the code with cuda-memcheck.

Maybe I found the problem:

it is the variable F…maybe a concurrent write.

Do you think is it not “polite” to pass by reference or by pointer local variables to device functions?

Passing local variables by pointer or reference to device functions should be fine. While a concurrent write or race might cause unpredictable results, it won’t necessarily cause execution problems. I say necessarily because there are still lots of ways a race on a pointer or index variable might lead to out of bounds memory access. It is really impossible to say more without seeing code.