How to load shared memory efficiently

cudahacker · April 18, 2011, 11:37am

Hi,
I have a CUDA application which does computation on a matrix. THerefore, I have split the matrix into tiles which have the size of a block with several threads in said block.

The tile has ~ the size of the shared memory.

Normally I would do s.th. like tile[ty][tx] = global_array[ … ];
In this case every thread would load its own value from global mem, coalesced by the hardware.

However, I also need the neighbor values of the tile to compute my results. Hence I have TILE_WIDTHTILE_WIDTH threads but I have (TILE_WIDTH+2)(TILE_WIDTH+2) loads.

What is the most efficient solution to perform these loads?

Perform the loads as above (so every thread posts a load instruction) and then load the boundary values? (loading the boundary values would be slow because they are not on consecutive addresses.
or
Use a single (master) thread to perform all the loads, all other threads have to wait. In this case all loads (within a row) are consecutive, however, only one thread can issue the load instruction.

thanks

avidday · April 18, 2011, 11:48am

Option 1 is a lot better. Have all the threads read a row or column in the usual way (so that the read coalesces), and have the first and last threads of the row/column issue an extra read to fetch the extra boundary values to fill the tile. Having a non-warp size transaction size has a performance penalty, but keeping the bulk of the loads to match the coalescing rules is the best option I have found.

cudahacker · April 18, 2011, 12:00pm

cool, thanks, perfect answer to my question.

tera · April 18, 2011, 12:16pm

After finishing one block, it might be worth to have the same block work on a neighboring tile straight away so that some of the boundary values can be reused. Proceed in the direction where the accesses to the boundary values don’t coalesce to maximise savings.

Topic		Replies	Views
Using shared memory where a variable number of threads shares some data. CUDA Programming and Performance	3	4310	May 14, 2011
Disappointing shared memory performance CUDA Programming and Performance	3	737	September 8, 2011
Matrix multiplication CUDA CUDA Programming and Performance	7	2917	November 12, 2012
Another question about coalesced reads/writes CUDA Programming and Performance	10	2130	August 18, 2009
Shared memory doubt CUDA Programming and Performance	5	4598	June 11, 2008
Memory, Structs, arrays, etc... CUDA Programming and Performance	0	2285	October 1, 2009
Branch divergence, Boundary element exchange Optimization and best practices CUDA Programming and Performance	9	18568	December 13, 2007
Question regarding transfer from global to shared memory CUDA Programming and Performance	5	5965	November 27, 2010
how to use large data (some MB) in CUDA efficiently CUDA Programming and Performance	2	490	December 5, 2017
CUDA Exercise 6 problems CUDA Programming and Performance	9	6812	September 8, 2008

How to load shared memory efficiently

Related topics