# Align unaligned memory access

Hi, I need a copy kernel that takes a matrix (stored linearly in row order) and copies it three times like this:

suppose A =
a b c
d e f
g h i

becomes -> A’ =

``````              0 0 0 0 0 0 0 0
0 a b c 0 c b a
0 d e f 0  f e d
0 g h i 0  i h g
0 0 0 0 0 0 0 0
0 g h i  0 0 0 0
0 d e f  0 0 0 0
0 a b c 0 0 0 0
``````

I hope this is clear. My idea is to load 16x16 blocks of A into shared memory and then copy (and invert) rows of the block and then columns as shown in A’. This prevents strided global memory access for the columns and speeds up memory access.
I should note that A and A’ do not share memory addresses, so the copying is out-of-place.

The problem is that when A’ is allocated (also linear) it is aligned to at least 256 bytes, but when I write A into A’, the memory writes are misaligned because of the 0 added in front of each row (and after). This is true for each warp, which effectively halves the bandwidth I can achieve. Furthermore, later I need to extract A from A’ again and reading the data is again misaligned.

Does anyone has a suggestion or idea on how to solve this problem? Maybe I can use the align specifiers somehow?

Hi, I need a copy kernel that takes a matrix (stored linearly in row order) and copies it three times like this:

suppose A =
a b c
d e f
g h i

becomes -> A’ =

``````              0 0 0 0 0 0 0 0
0 a b c 0 c b a
0 d e f 0  f e d
0 g h i 0  i h g
0 0 0 0 0 0 0 0
0 g h i  0 0 0 0
0 d e f  0 0 0 0
0 a b c 0 0 0 0
``````

I hope this is clear. My idea is to load 16x16 blocks of A into shared memory and then copy (and invert) rows of the block and then columns as shown in A’. This prevents strided global memory access for the columns and speeds up memory access.
I should note that A and A’ do not share memory addresses, so the copying is out-of-place.

The problem is that when A’ is allocated (also linear) it is aligned to at least 256 bytes, but when I write A into A’, the memory writes are misaligned because of the 0 added in front of each row (and after). This is true for each warp, which effectively halves the bandwidth I can achieve. Furthermore, later I need to extract A from A’ again and reading the data is again misaligned.

Does anyone has a suggestion or idea on how to solve this problem? Maybe I can use the align specifiers somehow?

Why don’t shuffle the data in shared memory so that the accesses to A’ are aligned again? That would require reading the next block ahead of writeout. You would just lose the bandwith taken up by the 0 and potentially for what might be left over on the other side of the matrix, but you don’t incur a factor of 2.

Why don’t shuffle the data in shared memory so that the accesses to A’ are aligned again? That would require reading the next block ahead of writeout. You would just lose the bandwith taken up by the 0 and potentially for what might be left over on the other side of the matrix, but you don’t incur a factor of 2.