Hi, I need a copy kernel that takes a matrix (stored linearly in row order) and copies it three times like this:
suppose A =
a b c
d e f
g h i
becomes -> A’ =
0 0 0 0 0 0 0 0 0 a b c 0 c b a 0 d e f 0 f e d 0 g h i 0 i h g 0 0 0 0 0 0 0 0 0 g h i 0 0 0 0 0 d e f 0 0 0 0 0 a b c 0 0 0 0
I hope this is clear. My idea is to load 16x16 blocks of A into shared memory and then copy (and invert) rows of the block and then columns as shown in A’. This prevents strided global memory access for the columns and speeds up memory access.
I should note that A and A’ do not share memory addresses, so the copying is out-of-place.
The problem is that when A’ is allocated (also linear) it is aligned to at least 256 bytes, but when I write A into A’, the memory writes are misaligned because of the 0 added in front of each row (and after). This is true for each warp, which effectively halves the bandwidth I can achieve. Furthermore, later I need to extract A from A’ again and reading the data is again misaligned.
Does anyone has a suggestion or idea on how to solve this problem? Maybe I can use the align specifiers somehow?