GM2=GM1 is faster than "SM=GM1; GM2=SM;" ? memory access time

GM1, GM2: global memory arrays
SM: shared memory array.
If I only access GM2, GM1 once, I think ]
GM2[a]=GM1[b] is faster than “SM=GM1[b]; GM2[a]=SM;”. (a and b are both random w.r.t thread id)
right?
Thanks!

If you only are moving data between global memory arrays, then correct, shared memory doesn’t provide much benefit.

However, as I’ve learned myself, some things that you might initially implement this way can benefit from staging in shared memory. An example is matrix transpose. The obvious way to transpose a matrix is

out[x][y] = in[y][x];

The problem is that either the read or the write will be uncoalesced, resulting in very low bandwidth for the entire transpose. To fix this, load a 2D block of data into shared memory (with fully coalesced reads), and then write the block (transposed) out into the transposed block location in the destination array (with fully coalesced writes).

If anyone is interested in code for this I can provide it.

Mark

Hi Mark, I am interested in your mentioned code. Thanks

Here’s the kernel code. It’s pretty fast.

template <typename T>

__global__ void transpose(T *out,

                          T *in,

                          int width,

                          int height)

{

    __shared__ T block[BLOCK_NUM_THREADS];

   unsigned int xBlock = __mul24(blockDim.x, blockIdx.x);

    unsigned int yBlock = __mul24(blockDim.y, blockIdx.y);

   // load block into smem

    unsigned int index_in  = 

            __mul24(__mul24(blockDim.x, gridDim.x), yBlock + threadIdx.y) + 

            xBlock + threadIdx.x;

   unsigned int index_block = __mul24(threadIdx.y, blockDim.x) + threadIdx.x;

    block[index_block] = in[index_in];

   

    unsigned int index_transpose = __mul24(threadIdx.x, blockDim.x) + threadIdx.y;

    

    unsigned int index_out =

            __mul24(__mul24(blockDim.x, gridDim.x), xBlock + threadIdx.y) +

            yBlock + threadIdx.x;

   __syncthreads();

   // write it out (transposed) into the new location

    out[index_out] = block[index_transpose];

}

Note that

  1. It is templatized so you can easily apply it to arrays of different datatypes (yes, templates work in CUDA – more on that to come).

  2. It assumes that the array dimensions are a multiple of the block size. If you want to handle arrays that are not, you can just add some if statements to avoid computing on threads whose x or y index is outside of the array.

  3. It uses 24-bit multiplies on indices to save addressing cycles.

Mark

one question: if(thread idx>array size) break; can’t be used before _syncthread(), right?

so how do you do the if condition?

template <typename T>

__global__ void transpose(T *out,

                         T *in,

                         int width,

                         int height)

{

   __shared__ T block[BLOCK_NUM_THREADS];

  unsigned int xBlock = __mul24(blockDim.x, blockIdx.x);

   unsigned int yBlock = __mul24(blockDim.y, blockIdx.y);

   unsigned int xIndex = xBlock + threadIdx.x;

   unsigned int yIndex = yBlock + threadIdx.y;

   unsigned int index_out, index_transpose;

  if (xIndex < width && yIndex < height)

   {

       // load block into smem

       unsigned int index_in  =

           __mul24(__mul24(blockDim.x, gridDim.x), yIndex) + xIndex;

      unsigned int index_block = __mul24(threadIdx.y, blockDim.x) + threadIdx.x;

       block[index_block] = in[index_in];

      index_transpose = __mul24(threadIdx.x, blockDim.x) +   threadIdx.y;

   

       index_out = __mul24(__mul24(blockDim.x, gridDim.x), xBlock + threadIdx.y) +

           yBlock + threadIdx.x;

   }

   __syncthreads();

  if (xIndex < width && yIndex < height)

   {

       // write it out (transposed) into the new location

       out[index_out] = block[index_transpose];

   }

}

Thanks, so did i do.

another question:

Can I

if(blockId…)

{

break;

}

…sync();

?

since block behaves all the same with all threads, no one thread will wait at sync().

similar question:

Can I

if(blockId)

{

return;

}

sync();

?

Must block be defined in shared ram?

Assuming the data type is float. If block[BLOCK_NUM_THREADS] is define as just a single register variable, am I going to see a drop in performance?

The efficient transpose relies on threads being able to read data loaded by other threads. Threads cannot read each others’ registers, so you will only be able to do the obvious transpose algorithm (the one with non-coalesced loads and very poor performance) if you don’t use shared memory.

As for the other question about syncs – your question is unclear.

Mark

Hi, im trying to use mark’s code to do the transpose, and i’m “calling” the function this way (short form):
size_x=64; size_y=256;
dim3 threads(16,16);
dim3 grid( size_x/ threads.x,size_y / threads.y);
transpose<<< grid, threads >>>(d_idata, d_odata, size_x,size_y);

If I use the same size (i.e. 64 and 64) it works, if the size is different, doesn’t.
In the kernel, i change all of the templates T for float, and I put shared float block[16*16];
Is this ok?

P.S. Im not using the version with the ifs.
Thanks a lot!

Hi – the latest CUDA SDK 0.8.1 includes an updated “transpose” sample that includes a version of the code posted here with the bugs fixed and should be efficient for arbitrary matrix dimensions. Please try that.

Thanks,
Mark