Coalesced and conflict free memory access using cuda::memcpy_async/cp.async

TheRune · November 13, 2024, 10:17am

Just took another stab at figuring this out and think I might have found the reason for my problems.

@merlintiger synchronous copies does generally give me worse performance than cp.async, even when the latter has excessive wavefronts, but this could just be because the benefits of using asynchronous copies in my code outweighs the penalties caused by uncoalesced access.

@bcurl3ss alignment was not the issue, but that is indeed important to ensure, as stated here: CUDA C++ Programming Guide and here: CUDA C++ Programming Guide.

Rather, it seems like conditionals around the cp.async calls prevented them from being fully coalesced, even though only a single branch of these conditionals was taken at runtime. This seems to prevent even very simple access patterns of consecutive addresses from being coalesced. Removing the conditionals resulted in Nsight Compute reporting no excessive accesses.

The conditionals were used for handling the bounds of the array, but it seems like this should instead be done by using the ability of cp.async to fill with zeroes. As far as I can tell, this is not possible through the memcpy_async CUDA functions, but can be done using __pipeline_memcpy_async or inline PTX.

If this is expected behavior, I fell like it should be better described in the documentation.

Topic		Replies	Views
cudaMemcpyAsync not behaving asynchronously CUDA Programming and Performance	5	2514	July 4, 2008
Async Memcpy calls blocking main thread CUDA Programming and Performance	3	2488	November 19, 2011
Accesing memory from both kernel and host side CUDA Programming and Performance	1	3061	February 17, 2008
SimpleStreams and Asyncmemory copies are slow CUDA Programming and Performance	4	2213	February 12, 2009
Coalesced acces slower than non coalesced CUDA Programming and Performance	4	917	February 7, 2011
Problem with programming guide example async memory copying CUDA Programming and Performance	7	3754	July 23, 2008
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2326	May 30, 2009
Effective global memory bandwidth? CUDA Programming and Performance	17	17660	September 18, 2007
memory coalescing and syncronizing problem CUDA Programming and Performance	1	2476	November 10, 2008
understanding (half-)wraps CUDA Programming and Performance	9	4513	October 27, 2010

Coalesced and conflict free memory access using cuda::memcpy_async/cp.async

Related topics