NVIDIA Developer Forums

Align unaligned memory access

Accelerated Computing CUDA CUDA Programming and Performance

folkert October 21, 2010, 10:14am 1

Hi, I need a copy kernel that takes a matrix (stored linearly in row order) and copies it three times like this:

suppose A =
a b c
d e f
g h i

becomes → A’ =

              0 0 0 0 0 0 0 0
              0 a b c 0 c b a
              0 d e f 0  f e d
              0 g h i 0  i h g
              0 0 0 0 0 0 0 0
              0 g h i  0 0 0 0
              0 d e f  0 0 0 0
              0 a b c 0 0 0 0

I hope this is clear. My idea is to load 16x16 blocks of A into shared memory and then copy (and invert) rows of the block and then columns as shown in A’. This prevents strided global memory access for the columns and speeds up memory access.
I should note that A and A’ do not share memory addresses, so the copying is out-of-place.

The problem is that when A’ is allocated (also linear) it is aligned to at least 256 bytes, but when I write A into A’, the memory writes are misaligned because of the 0 added in front of each row (and after). This is true for each warp, which effectively halves the bandwidth I can achieve. Furthermore, later I need to extract A from A’ again and reading the data is again misaligned.

Does anyone has a suggestion or idea on how to solve this problem? Maybe I can use the align specifiers somehow?

folkert October 21, 2010, 10:14am 2

Hi, I need a copy kernel that takes a matrix (stored linearly in row order) and copies it three times like this:

suppose A =
a b c
d e f
g h i

becomes → A’ =

              0 0 0 0 0 0 0 0
              0 a b c 0 c b a
              0 d e f 0  f e d
              0 g h i 0  i h g
              0 0 0 0 0 0 0 0
              0 g h i  0 0 0 0
              0 d e f  0 0 0 0
              0 a b c 0 0 0 0

I hope this is clear. My idea is to load 16x16 blocks of A into shared memory and then copy (and invert) rows of the block and then columns as shown in A’. This prevents strided global memory access for the columns and speeds up memory access.
I should note that A and A’ do not share memory addresses, so the copying is out-of-place.

The problem is that when A’ is allocated (also linear) it is aligned to at least 256 bytes, but when I write A into A’, the memory writes are misaligned because of the 0 added in front of each row (and after). This is true for each warp, which effectively halves the bandwidth I can achieve. Furthermore, later I need to extract A from A’ again and reading the data is again misaligned.

Does anyone has a suggestion or idea on how to solve this problem? Maybe I can use the align specifiers somehow?

tera October 21, 2010, 12:40pm 3

Why don’t shuffle the data in shared memory so that the accesses to A’ are aligned again? That would require reading the next block ahead of writeout. You would just lose the bandwith taken up by the 0 and potentially for what might be left over on the other side of the matrix, but you don’t incur a factor of 2.

tera October 21, 2010, 12:40pm 4

Why don’t shuffle the data in shared memory so that the accesses to A’ are aligned again? That would require reading the next block ahead of writeout. You would just lose the bandwith taken up by the 0 and potentially for what might be left over on the other side of the matrix, but you don’t incur a factor of 2.

Topic		Replies	Views	Activity
Alignment requirements CUDA Programming and Performance	4	3117	July 25, 2009
Unaligned memory load CUDA Programming and Performance	2	3130	May 6, 2009
Shared Memory (Unaligned Memory access) CUDA Programming and Performance	6	1845	December 22, 2009
problem coping data from global to shared problem coping data from global to share CUDA Programming and Performance	9	4219	May 24, 2007
Error: Unaligned memory accesses not supported aligned memory accesses CUDA Programming and Performance	2	3636	January 11, 2010
question on memory coalescing and alignment CUDA Programming and Performance	0	1894	January 28, 2008
Memory Access CUDA Programming and Performance	2	1106	June 22, 2009
Write/read shared memory on compute capability 2.1 CUDA Programming and Performance	3	934	November 21, 2012
Misaligned starting address for memory coalescing CUDA Programming and Performance	4	3627	March 31, 2011
strange behavior using cudaMallocPitched CUDA Programming and Performance	3	1598	October 16, 2010