Coalesced and conflict free memory access using cuda::memcpy_async/cp.async

Just took another stab at figuring this out and think I might have found the reason for my problems.

@merlintiger synchronous copies does generally give me worse performance than cp.async, even when the latter has excessive wavefronts, but this could just be because the benefits of using asynchronous copies in my code outweighs the penalties caused by uncoalesced access.

@bcurl3ss alignment was not the issue, but that is indeed important to ensure, as stated here: CUDA C++ Programming Guide and here: CUDA C++ Programming Guide.

Rather, it seems like conditionals around the cp.async calls prevented them from being fully coalesced, even though only a single branch of these conditionals was taken at runtime. This seems to prevent even very simple access patterns of consecutive addresses from being coalesced. Removing the conditionals resulted in Nsight Compute reporting no excessive accesses.

The conditionals were used for handling the bounds of the array, but it seems like this should instead be done by using the ability of cp.async to fill with zeroes. As far as I can tell, this is not possible through the memcpy_async CUDA functions, but can be done using __pipeline_memcpy_async or inline PTX.

If this is expected behavior, I fell like it should be better described in the documentation.