How to load shared memory efficiently

I have a CUDA application which does computation on a matrix. THerefore, I have split the matrix into tiles which have the size of a block with several threads in said block.

The tile has ~ the size of the shared memory.

Normally I would do like tile[ty][tx] = global_array[ … ];
In this case every thread would load its own value from global mem, coalesced by the hardware.

However, I also need the neighbor values of the tile to compute my results. Hence I have TILE_WIDTHTILE_WIDTH threads but I have (TILE_WIDTH+2)(TILE_WIDTH+2) loads.

What is the most efficient solution to perform these loads?

  1. Perform the loads as above (so every thread posts a load instruction) and then load the boundary values? (loading the boundary values would be slow because they are not on consecutive addresses.
  2. Use a single (master) thread to perform all the loads, all other threads have to wait. In this case all loads (within a row) are consecutive, however, only one thread can issue the load instruction.


Option 1 is a lot better. Have all the threads read a row or column in the usual way (so that the read coalesces), and have the first and last threads of the row/column issue an extra read to fetch the extra boundary values to fill the tile. Having a non-warp size transaction size has a performance penalty, but keeping the bulk of the loads to match the coalescing rules is the best option I have found.

cool, thanks, perfect answer to my question.

After finishing one block, it might be worth to have the same block work on a neighboring tile straight away so that some of the boundary values can be reused. Proceed in the direction where the accesses to the boundary values don’t coalesce to maximise savings.