Turning crazy with shared memory There is something I am still missing with shared memory...

asclepios · June 8, 2011, 2:32pm

Hi

There is probably something I am still missing with the shared memory approach.

I have the following device function which works fine

__global__ void DiffX_GPU(float* d_U, float* d_Ux, int Nx, int alpha , float* d_stencils, int rank ) 

{

     //indices

    const int b_i = blockIdx.x*blockDim.x + threadIdx.x;

    int row = b_i;  

    float value=0.0;

     // Compute dot-product between FDM stencil weights and input vector U

     int diff = 0; // diff is used for automatically taking one-sided difference near boundaries

     if (row<alpha)

        diff = alpha - row;

     else if (row>Nx-1-alpha)  // row  >   Nx-3 Nx-2 Nx-1

        diff = Nx-1-alpha-row;

int tmp = (alpha-diff)*rank+alpha;

     int tmp2 = row + diff;

int i;

     for (i = -alpha; i<alpha+1; ++i)

         value += d_U[tmp2+i]*d_stencils[tmp+i];   ;//(d_U[tmp2+i]);//*d_stencils[tmp+i])  ;

// Store computed approximation

           d_Ux[row] =      value; 

}

I need to make a shared version. But first of all I only manage to go down from 0.23 to 0.19 milliseconds each time I call the function. Moreover, the results are wrong.

Any Idea what I am messing up External Image External Image External Image :

__global__ void DiffX_GPU(float* d_U, float* d_Ux, int Nx, int alpha , float* d_stencils, int rank ) 

{

    //indices

     const int b_i = blockIdx.x*blockDim.x + threadIdx.x;

    int row = b_i;  

    float value=0.0;

     // Compute dot-product between FDM stencil weights and input vector U

     int diff = 0; // diff is used for automatically taking one-sided difference near boundaries

     if (row<alpha)

        diff = alpha - row;

     else if (row>Nx-1-alpha)  // row  >   Nx-3 Nx-2 Nx-1

        diff = Nx-1-alpha-row;

int tmp = (alpha-diff)*rank+alpha;

     int tmp2 = row + diff;

__shared__ float s_U [Shared_block];

      for( int b=0; b < Nx; b++) 

          s_U[threadIdx.x] = d_U[tmp2+threadIdx.x];

int i;

     for (i = -alpha; i<alpha+1; ++i)

            value += s_U[tmp2+i]*d_stencils[tmp+i];   ;//(d_U[tmp2+i]);//*d_stencils[tmp+i])  ;

// Store computed approximation

           d_Ux[row] =      value; 

 // synchronize

  __syncthreads ();

}

Thanx in advance

cbuchner1 · June 8, 2011, 2:38pm

missing a __syncthreads() before the int i;

asclepios · June 8, 2011, 3:12pm

Why is the syncthreads suppose to be before the int i and not afterwards ? I have never head this . Btw it does not change anything, it just makes the program slower

tera · June 8, 2011, 3:42pm

You can just as well put the [font=“Courier New”]__syncthreads()[/font] directly after the [font=“Courier New”]int i[/font], if you wish so. The important point is that there always needs to be one between one thread writing shared memory and a different thread reading the same location.

A [font=“Courier New”]__syncthreads()[/font] as the very last instruction of a kernel is always superfluous, as the scope of shared memory always is until the last thread of a block exits.

There is another, unrelated bug in your kernel: You are loading the wrong range of elements of [font=“Courier New”]d_U[/font] into shared memory, which can be seen from the fact that [font=“Courier New”]i[/font] can become negative but [font=“Courier New”]threadIdx.x[/font] always is positive.

Topic		Replies	Views
help getting shared memory working CUDA Programming and Performance	11	4289	June 12, 2007
do not understand thread/block division CUDA Programming and Performance	10	2799	April 23, 2012
Shared memory problem CUDA Programming and Performance	3	2259	February 8, 2008
Problems when using shared memory CUDA Programming and Performance cuda	1	312	May 15, 2024
How to implement shared memory in kernel CUDA Programming and Performance	2	951	May 4, 2011
Basic question about shared memory usage and device functions CUDA Programming and Performance cuda , kernel	3	1733	May 16, 2022
Dynamic Shared Memory Tesla vs Fermi CUDA Programming and Performance	0	999	May 26, 2011
Getting wrong output from CUDA kernel CUDA Programming and Performance	6	8288	April 15, 2011
why result varied based on different number of threads per block? CUDA Programming and Performance	8	1941	March 1, 2011
Using shared Memory CUDA Programming and Performance	3	4869	March 11, 2012

Turning crazy with shared memory There is something I am still missing with shared memory...

Related topics