Hello, I have a program using cuda::memcpy_async to move data from global memory to shared memory.
I am getting less performance than expected, and have therefore profiled the code using Nsight Compute.
The SASS instruction corresponding to the copy, LDGSTS.E.BYPASS.128, shows 50% excessive global accesses and 87.5% excessive shared accesses.
I am attempting to implement swizzling as described here, so the access pattern is a bit complicated, but supposedly it should still be fully coalesced and bank-conflict free.
If I replace the asynchronous copies with ordinary (synchronous) copies, and profile again using Nsight Compute, there are no mentions of excessive accesses to either global or shared memory, even though the access pattern remains the same.
I have also tried using the cp.async PTX instruction directly using inline assembly, but this gives me the same results as when using cuda::memcpy_async.
Are there any reasons why using asynchronous copies might introduce uncoalesced accesses not present when using synchronous copies and the same access pattern?
Are there different guidelines to adhere to in order to achieve coalesced access when using asynchronous copies, compared to when using synchronous copies?
I’m actually having a very similar issue. My synchronous code works correctly, profiling reports no issues, but when I substitute in memcpy_async, some of my LDGSTS.E.BYPASS.128 instructions have 50% excessive global accesses in some circumstances. I would expect each global memory request would only read 16 sectors, but in some cases I see 32 sectors per request. I also see shared memory conflicts of 50% to 75%. Nothing should be conflicted as far as I can tell. I am pipelining the memcpy_async instructions and the variability I see appears when I vary the number of stages in the pipeline.
Are you still having your issues? I will try to get a minimal reproducible example. If I’m still having the issue with a simple example, maybe we can figure it out together.
That does indeed sound very similar, but sadly I have still not managed to find a solution, and have moved on to try using Cutlass instead.
I also found this forum post from last year which seems to describe a similar issue, but with no answers.
A minimal example might be interesting, perhaps it might even be possible to find optimal patterns using trial and error and brute force.
Alternatively, it should be possible to figure out what patterns are used by Cutlass and use that, since their kernels do not seem to have this issue.
I think I figured out my issue. I created a minimal reproducible example, and it worked without any of the issues I saw in my other code. After inspecting everything to find what was different, the only thing I found was that the memory address for shared memory had weird alignment. In my test case that worked I had some large alignment. In my code with issues I had only 16B alignment. After forcing alignment to 128B on my int4 array my issues seem to have gone away!
Here is what I did. Let me know if you think this might be your problem.