I have a CUDA application which does computation on a matrix. THerefore, I have split the matrix into tiles which have the size of a block with several threads in said block.
The tile has ~ the size of the shared memory.
Normally I would do s.th. like tile[ty][tx] = global_array[ … ];
In this case every thread would load its own value from global mem, coalesced by the hardware.
However, I also need the neighbor values of the tile to compute my results. Hence I have TILE_WIDTHTILE_WIDTH threads but I have (TILE_WIDTH+2)(TILE_WIDTH+2) loads.
What is the most efficient solution to perform these loads?
- Perform the loads as above (so every thread posts a load instruction) and then load the boundary values? (loading the boundary values would be slow because they are not on consecutive addresses.
- Use a single (master) thread to perform all the loads, all other threads have to wait. In this case all loads (within a row) are consecutive, however, only one thread can issue the load instruction.