Turning crazy with shared memory There is something I am still missing with shared memory...

Hi

There is probably something I am still missing with the shared memory approach.

I have the following device function which works fine

__global__ void DiffX_GPU(float* d_U, float* d_Ux, int Nx, int alpha , float* d_stencils, int rank ) 

{

     //indices

    const int b_i = blockIdx.x*blockDim.x + threadIdx.x;

    int row = b_i;  

    float value=0.0;

     // Compute dot-product between FDM stencil weights and input vector U

     int diff = 0; // diff is used for automatically taking one-sided difference near boundaries

     if (row<alpha)

        diff = alpha - row;

     else if (row>Nx-1-alpha)  // row  >   Nx-3 Nx-2 Nx-1

        diff = Nx-1-alpha-row;

int tmp = (alpha-diff)*rank+alpha;

     int tmp2 = row + diff;

int i;

     for (i = -alpha; i<alpha+1; ++i)

         value += d_U[tmp2+i]*d_stencils[tmp+i];   ;//(d_U[tmp2+i]);//*d_stencils[tmp+i])  ;

// Store computed approximation

           d_Ux[row] =      value; 

}

I need to make a shared version. But first of all I only manage to go down from 0.23 to 0.19 milliseconds each time I call the function. Moreover, the results are wrong.

Any Idea what I am messing up External Image External Image External Image :

__global__ void DiffX_GPU(float* d_U, float* d_Ux, int Nx, int alpha , float* d_stencils, int rank ) 

{

    //indices

     const int b_i = blockIdx.x*blockDim.x + threadIdx.x;

    int row = b_i;  

    float value=0.0;

     // Compute dot-product between FDM stencil weights and input vector U

     int diff = 0; // diff is used for automatically taking one-sided difference near boundaries

     if (row<alpha)

        diff = alpha - row;

     else if (row>Nx-1-alpha)  // row  >   Nx-3 Nx-2 Nx-1

        diff = Nx-1-alpha-row;

int tmp = (alpha-diff)*rank+alpha;

     int tmp2 = row + diff;

__shared__ float s_U [Shared_block];

      for( int b=0; b < Nx; b++) 

          s_U[threadIdx.x] = d_U[tmp2+threadIdx.x];

int i;

     for (i = -alpha; i<alpha+1; ++i)

            value += s_U[tmp2+i]*d_stencils[tmp+i];   ;//(d_U[tmp2+i]);//*d_stencils[tmp+i])  ;

// Store computed approximation

           d_Ux[row] =      value; 

 // synchronize

  __syncthreads ();

}

Thanx in advance

missing a __syncthreads() before the int i;

Why is the syncthreads suppose to be before the int i and not afterwards ? I have never head this . Btw it does not change anything, it just makes the program slower

You can just as well put the [font=“Courier New”]__syncthreads()[/font] directly after the [font=“Courier New”]int i[/font], if you wish so. The important point is that there always needs to be one between one thread writing shared memory and a different thread reading the same location.

A [font=“Courier New”]__syncthreads()[/font] as the very last instruction of a kernel is always superfluous, as the scope of shared memory always is until the last thread of a block exits.

There is another, unrelated bug in your kernel: You are loading the wrong range of elements of [font=“Courier New”]d_U[/font] into shared memory, which can be seen from the fact that [font=“Courier New”]i[/font] can become negative but [font=“Courier New”]threadIdx.x[/font] always is positive.