Coalesced and conflict free memory access using cuda::memcpy_async/cp.async

That does indeed sound very similar, but sadly I have still not managed to find a solution, and have moved on to try using Cutlass instead.

I also found this forum post from last year which seems to describe a similar issue, but with no answers.

A minimal example might be interesting, perhaps it might even be possible to find optimal patterns using trial and error and brute force.
Alternatively, it should be possible to figure out what patterns are used by Cutlass and use that, since their kernels do not seem to have this issue.