Why shared memory is slower than global memory with gradient computation?

I have two functions to compute gradient of 3D image. One function use the global memory directly, other use shared memory.

When I benchmark the speed of two function, the global memory function is 1.2Mpixels /second while the shared memory only achieves 170k.

the result runs on the Tesla S1070 with input size 256^3

So my question is why the shared memory version uses much less memory bandwidth but much slower (about 7 time slower) than the global version ?

What is the best practice for the stencil buffer problem, like gradient or FDM ? When we should use shared memory, when you should not

Any ideas are greatly appreciated!

Here are the codes

[codebox]

template

global void cudaComputeGradient_kernel(float * d_gdx, float * d_gdy, float * d_gdz,

                                       float * d_i,

                                       int w, int h, int l,

                                       float sx, float sy, float sz)

{

int i = threadIdx.x + blockIdx.x * blockDim.x;

int j = threadIdx.y + blockIdx.y * blockDim.y;

int wh = w * h;

int id = j * w + i;

if (i<w && j < h)

    for (int k=0 ; k < l; ++k, id += wh){

if (i == 0)

            d_gdx[id] = (d_i[id+1] - d_i[id]) / sx;

        else if (i == w - 1) 

            d_gdx[id] =  (d_i[id] - d_i[id-1]) / sx;

        else

            d_gdx[id] = (d_i[id+1] - d_i[id-1]) / (2.f * sx);

if (j == 0)

            d_gdy[id] =  (d_i[id+w] - d_i[id]) / sy;

        else if (j == h - 1) 

            d_gdy[id] = (d_i[id] - d_i[id-w]) / sy;

        else

            d_gdy[id] = (d_i[id+w] - d_i[id-w]) / (2.f * sy);

if (k == 0)

            d_gdz[id] =  (d_i[id+wh] - d_i[id]) / sz;

        else if (k == l - 1) 

            d_gdz[id] =  (d_i[id] - d_i[id-wh]) / sz;

        else

            d_gdz[id] = (d_i[id+wh] - d_i[id-wh]) / (2.f * sz);

    }

}

void cudaComputeGradient(float * d_gdx, float * d_gdy, float * d_gdz, // out put gradient

                            float * d_i,                                 // input image

                            uint w, uint h, uint l,                      // size of image  

                            float sx, float sy, float sz)   // spacing 

{

dim3 threads(16,16);

dim3 grids(iDivUp16(w), iDivUp16(h));

cudaComputeGradient_kernel<<<grids, threads>>>(d_gdx, d_gdy, d_gdz, d_i,

                                                             w, h, l,

                                                             sx, sy, sz);

}

[/codebox]

[codebox]

global void cudaComputeGradient_shared_kernel(float * d_gdx, float * d_gdy, float * d_gdz,

                                              float * d_i,

                                              int w, int h, int l,

                                              float sx, float sy, float sz)

{

const int radius = 1;

float s_data[16 + radius * 2][16 + radius * 2];

int i = threadIdx.x + blockIdx.x * blockDim.x;

int j = threadIdx.y + blockIdx.y * blockDim.y;

if (i < w && j < h){

int planeSize = w* h;

    int id        = j * w + i;

float back = 0;

    float current = 0;

    float front   = d_i[id];

for (int k=0; k< l; ++k){

        back    = current;

        current = front;

        front   = d_i[id + planeSize];

__syncthreads();

int tx = threadIdx.x + radius;

        int ty = threadIdx.y + radius;

if (threadIdx.x < radius){

            s_data[ty][threadIdx.x]               = d_i[id - radius];

            s_data[ty][threadIdx.x + 16 + radius] = d_i[id + 16];

        }

if (threadIdx.y < radius){

            s_data[threadIdx.y][tx]               = d_i[id - radius * w];

            s_data[threadIdx.y + 16 + radius][tx] = d_i[id + 16 * w];

        }

s_data[ty][tx] = current;

__syncthreads();

if (i == 0)

            d_gdx[id] = (s_data[ty][tx+1] -s_data[ty][tx]) / sx;

        else if (i == w - 1)

            d_gdx[id] = (s_data[ty][tx] -s_data[ty][tx-1]) / sx;

        else 

            d_gdx[id] = (s_data[ty][tx+1] -s_data[ty][tx-1])/ (2 * sx);

if (j == 0)

            d_gdy[id] = (s_data[ty+1][tx] -s_data[ty][tx]) /sy;

        else if (j == h - 1)

            d_gdy[id] = (s_data[ty][tx] -s_data[ty-1][tx]) /sy;

        else

            d_gdy[id] = (s_data[ty+1][tx] -s_data[ty-1][tx])/(2 * sy);

if (k==0)

            d_gdz[id] = (front - current)/sz;

        else if (k == l - 1)

            d_gdz[id] = (current - back) /sz;

        else

            d_gdz[id] = (front - back) /(2 * sz);

id += planeSize;

    }

}

}

void cudaComputeGradient_shared(float * d_gdx, float * d_gdy, float * d_gdz, // out put gradient

                            float * d_i,                                 // input image

                            uint w, uint h, uint l,                      // size of image  

                            float sx, float sy, float sz)   // spacing 

{

dim3 threads(16,16);

dim3 grids(iDivUp16(w), iDivUp16(h));

cudaComputeGradient_shared_kernel<<<grids, threads>>>(d_gdx, d_gdy, d_gdz, d_i,

                                                             w, h, l,

                                                             sx, sy, sz);

}

[/codebox]

In your second example, did you really combine shared memory with texture reads?

Doing that does not seem to make much sense. The texture reads already solve the bandwidth issue caused by reading the same pixel several times, eliminating the need for shared memory use.

I suggest you analyze your shared memory access patterns for bank conflicts, for example with the bank checker macro of cutil. It’s not perfect, but in give some clues about what is going on here.

Christian

The second example also contains a number of full 32 bit integer multiplies (and integer->float casts) which are going to be a lot slower than the code in the first kernel. I have used the same basic shared memory idea (sans the texture reads) in 3D finite difference kernels and my (probably crappy) version hits about 90Gb/s on a GTX 275 for second order compact stencil, so there is nothing wrong with the idea.

I feel compelled to add that those scrolling code boxes are just evil.

@cbuchner1: There is no getTexVal function in the manual. I think it is just some device function

@Linh Ha: You are not using shared memory in the second example!

If you want s_data to reside in shared memory you have to write shared before type specifier:

__shared__ float s_data[dimX][dimY]

The way you have written it right now, there is a separate array for each thread, and it is residing in local memory. Local memory is approximately as slow as global one.

Also I would suggest precomputing a recipocial of sx, sy, sz and where you compute d_gdx and d_gdy values, multiplying by that precomputed value instead. Division is much slower that multiplication.

I did try texture. Agree that texture memory access partially solve the problem. I have tried with texture, it is not significantly faster than global mem (since this one also satisfies the coalesced condition)

You are right, i forgot that. It explains the problem, i will check the performance. Thanks

With only a small change as suggested
shared float s_data, it is twice as fast. Thank everyone