GM2=GM1 is faster than "SM=GM1; GM2=SM;" ? memory access time

yk_cadcg · March 6, 2007, 11:47am

GM1, GM2: global memory arrays
SM: shared memory array.
If I only access GM2, GM1 once, I think ]
GM2[a]=GM1[b] is faster than “SM=GM1[b]; GM2[a]=SM;”. (a and b are both random w.r.t thread id)
right?
Thanks!

Mark_Harris · March 6, 2007, 12:02pm

If you only are moving data between global memory arrays, then correct, shared memory doesn’t provide much benefit.

However, as I’ve learned myself, some things that you might initially implement this way can benefit from staging in shared memory. An example is matrix transpose. The obvious way to transpose a matrix is

out[x][y] = in[y][x];

The problem is that either the read or the write will be uncoalesced, resulting in very low bandwidth for the entire transpose. To fix this, load a 2D block of data into shared memory (with fully coalesced reads), and then write the block (transposed) out into the transposed block location in the destination array (with fully coalesced writes).

If anyone is interested in code for this I can provide it.

Mark

Archer · March 7, 2007, 12:28pm

If you only are moving data between global memory arrays, then correct, shared memory doesn’t provide much benefit.

However, as I’ve learned myself, some things that you might initially implement this way can benefit from staging in shared memory. An example is matrix transpose. The obvious way to transpose a matrix is
out[x][y] = in[y][x];
The problem is that either the read or the write will be uncoalesced, resulting in very low bandwidth for the entire transpose. To fix this, load a 2D block of data into shared memory (with fully coalesced reads), and then write the block (transposed) out into the transposed block location in the destination array (with fully coalesced writes).

If anyone is interested in code for this I can provide it.

Mark

[snapback]167911[/snapback]

Hi Mark, I am interested in your mentioned code. Thanks

Mark_Harris · March 7, 2007, 6:58pm

Here’s the kernel code. It’s pretty fast.

template <typename T>

__global__ void transpose(T *out,

                          T *in,

                          int width,

                          int height)

{

    __shared__ T block[BLOCK_NUM_THREADS];

   unsigned int xBlock = __mul24(blockDim.x, blockIdx.x);

    unsigned int yBlock = __mul24(blockDim.y, blockIdx.y);

   // load block into smem

    unsigned int index_in  = 

            __mul24(__mul24(blockDim.x, gridDim.x), yBlock + threadIdx.y) + 

            xBlock + threadIdx.x;

   unsigned int index_block = __mul24(threadIdx.y, blockDim.x) + threadIdx.x;

    block[index_block] = in[index_in];

   

    unsigned int index_transpose = __mul24(threadIdx.x, blockDim.x) + threadIdx.y;

    

    unsigned int index_out =

            __mul24(__mul24(blockDim.x, gridDim.x), xBlock + threadIdx.y) +

            yBlock + threadIdx.x;

   __syncthreads();

   // write it out (transposed) into the new location

    out[index_out] = block[index_transpose];

}

Note that

It is templatized so you can easily apply it to arrays of different datatypes (yes, templates work in CUDA – more on that to come).
It assumes that the array dimensions are a multiple of the block size. If you want to handle arrays that are not, you can just add some if statements to avoid computing on threads whose x or y index is outside of the array.
It uses 24-bit multiplies on indices to save addressing cycles.

Mark

yk_cadcg · March 8, 2007, 5:09am

one question: if(thread idx>array size) break; can’t be used before _syncthread(), right?

so how do you do the if condition?

Mark_Harris · March 8, 2007, 12:53pm

template <typename T>

__global__ void transpose(T *out,

                         T *in,

                         int width,

                         int height)

{

   __shared__ T block[BLOCK_NUM_THREADS];

  unsigned int xBlock = __mul24(blockDim.x, blockIdx.x);

   unsigned int yBlock = __mul24(blockDim.y, blockIdx.y);

   unsigned int xIndex = xBlock + threadIdx.x;

   unsigned int yIndex = yBlock + threadIdx.y;

   unsigned int index_out, index_transpose;

  if (xIndex < width && yIndex < height)

   {

       // load block into smem

       unsigned int index_in  =

           __mul24(__mul24(blockDim.x, gridDim.x), yIndex) + xIndex;

      unsigned int index_block = __mul24(threadIdx.y, blockDim.x) + threadIdx.x;

       block[index_block] = in[index_in];

      index_transpose = __mul24(threadIdx.x, blockDim.x) +   threadIdx.y;

   

       index_out = __mul24(__mul24(blockDim.x, gridDim.x), xBlock + threadIdx.y) +

           yBlock + threadIdx.x;

   }

   __syncthreads();

  if (xIndex < width && yIndex < height)

   {

       // write it out (transposed) into the new location

       out[index_out] = block[index_transpose];

   }

}

yk_cadcg · March 8, 2007, 1:31pm

Thanks, so did i do.

another question:

Can I

if(blockId…)

{

break;

}

…sync();

?

since block behaves all the same with all threads, no one thread will wait at sync().

similar question:

Can I

if(blockId)

{

return;

}

sync();

?

template <typename T>

__global__ void transpose(T *out,

                         T *in,

                         int width,

                         int height)

{

   __shared__ T block[BLOCK_NUM_THREADS];

  unsigned int xBlock = __mul24(blockDim.x, blockIdx.x);

   unsigned int yBlock = __mul24(blockDim.y, blockIdx.y);

   unsigned int xIndex = xBlock + threadIdx.x;

   unsigned int yIndex = yBlock + threadIdx.y;

   unsigned int index_out, index_transpose;

  if (xIndex < width && yIndex < height)

   {

       // load block into smem

       unsigned int index_in  =

           __mul24(__mul24(blockDim.x, gridDim.x), yIndex) + xIndex;

      unsigned int index_block = __mul24(threadIdx.y, blockDim.x) + threadIdx.x;

       block[index_block] = in[index_in];

      index_transpose = __mul24(threadIdx.x, blockDim.x) +   threadIdx.y;

   

       index_out = __mul24(__mul24(blockDim.x, gridDim.x), xBlock + threadIdx.y) +

           yBlock + threadIdx.x;

   }

   __syncthreads();

  if (xIndex < width && yIndex < height)

   {

       // write it out (transposed) into the new location

       out[index_out] = block[index_transpose];

   }

}

[snapback]168986[/snapback]

Wai_K_Wu · March 8, 2007, 2:22pm

Must block be defined in shared ram?

Assuming the data type is float. If block[BLOCK_NUM_THREADS] is define as just a single register variable, am I going to see a drop in performance?

Here’s the kernel code. It’s pretty fast.
template <typename T>

__global__ void transpose(T *out,

                          T *in,

                          int width,

                          int height)

{

    __shared__ T block[BLOCK_NUM_THREADS];

   unsigned int xBlock = __mul24(blockDim.x, blockIdx.x);

    unsigned int yBlock = __mul24(blockDim.y, blockIdx.y);

   // load block into smem

    unsigned int index_in  = 

            __mul24(__mul24(blockDim.x, gridDim.x), yBlock + threadIdx.y) + 

            xBlock + threadIdx.x;

   unsigned int index_block = __mul24(threadIdx.y, blockDim.x) + threadIdx.x;

    block[index_block] = in[index_in];

   

    unsigned int index_transpose = __mul24(threadIdx.x, blockDim.x) + threadIdx.y;

    

    unsigned int index_out =

            __mul24(__mul24(blockDim.x, gridDim.x), xBlock + threadIdx.y) +

            yBlock + threadIdx.x;

   __syncthreads();

   // write it out (transposed) into the new location

    out[index_out] = block[index_transpose];

}
Note that

It is templatized so you can easily apply it to arrays of different datatypes (yes, templates work in CUDA – more on that to come).

It assumes that the array dimensions are a multiple of the block size. If you want to handle arrays that are not, you can just add some if statements to avoid computing on threads whose x or y index is outside of the array.

It uses 24-bit multiplies on indices to save addressing cycles.

Mark

[snapback]168627[/snapback]

Mark_Harris · March 8, 2007, 8:19pm

The efficient transpose relies on threads being able to read data loaded by other threads. Threads cannot read each others’ registers, so you will only be able to do the obvious transpose algorithm (the one with non-coalesced loads and very poor performance) if you don’t use shared memory.

As for the other question about syncs – your question is unclear.

Mark

ricardoasencio · April 13, 2007, 2:37pm

Hi, im trying to use mark’s code to do the transpose, and i’m “calling” the function this way (short form):
size_x=64; size_y=256;
dim3 threads(16,16);
dim3 grid( size_x/ threads.x,size_y / threads.y);
transpose<<< grid, threads >>>(d_idata, d_odata, size_x,size_y);

If I use the same size (i.e. 64 and 64) it works, if the size is different, doesn’t.
In the kernel, i change all of the templates T for float, and I put shared float block[16*16];
Is this ok?

P.S. Im not using the version with the ifs.
Thanks a lot!

Mark_Harris · April 19, 2007, 11:18am

Hi – the latest CUDA SDK 0.8.1 includes an updated “transpose” sample that includes a version of the code posted here with the bugs fixed and should be efficient for arbitrary matrix dimensions. Please try that.

Thanks,
Mark

Topic		Replies	Views
about __syncthreads() in SDK/project/transpose CUDA Programming and Performance	5	2791	September 18, 2009
An Efficient Matrix Transpose in CUDA C/C++ Technical Blog	31	2963	October 30, 2020
An Efficient Matrix Transpose in CUDA Fortran Technical Blog	2	480	February 5, 2014
Why the transpose speed is much quicker using shared memory inside CUDA Programming and Performance	3	6021	July 17, 2008
Cant understand Shared Memory Concept ! I want to talk Live to somebody who knows it !!& CUDA Programming and Performance	2	1501	April 13, 2009
Transfer a one-dimensional array saved by rows-major from global memory to shared memory CUDA Programming and Performance cuda	1	502	July 1, 2021
Newbie - Need to use shared mem? CUDA Programming and Performance	27	15245	December 17, 2008
Matrix transpose slower using shared memory CUDA Programming and Performance	5	1120	August 7, 2015
Using Shared Memory in CUDA Fortran Technical Blog	0	420	August 25, 2020
Shared vs. texture memory CUDA Programming and Performance	6	3266	April 18, 2009

GM2=GM1 is faster than "SM=GM1; GM2=SM;" ? memory access time

Related topics